Abstract
Genome-wide association studies (GWAS) have identified thousands of genetic risk variants. However, these variants have explained relatively little of estimated heritability for most complex diseases. The 1000 Genomes Project is a good source to impute missing genotypes for previous GWAS data. Imputation-based GWAS can identify more associated signals on a genome-wide scale. These new markers can be potential sources of missing heritability.
In this study, we did the genotype imputation on the Wellcome Trust Case Control Consortium Phase I genotype data using 1000 genomes as reference. Then we estimated the phenotypic variance explained by all significant association signals. The results suggested that the proportions of phenotypic variance explained by genetic variants increased significantly when the new association variants identified through 1000 Genomes-based imputation were included. These results were consistent with the hypothesis that larger number of variants that are yet to be identified as potential sources of missing heritability.
Figure
BD: bipolar disorder; CAD: coronary artery disease; CD: Crohn’s disease; RA: rheumatoid arthritis; T1D: type 1 diabetes; T2D: type 2 diabetes; K: prevalence; VE: explained variance; SE: standard error.
a: Genome-wide association analysis without imputation. b: Genome-wide association analysis with imputation using 1000 Genomes data as reference panel. c: The number of SNPs with p-value less than 1×10-8. d: The estimate of phenotypic variance explained by SNPs with p-value less than 1×10-8. The values in the parentheses are the standard error of the explained phenotypic variance.
Supplementary Figure 1. Estimate of the phenotypic variance explained by SNPs.
Supplementary Figure 2. Estimation of the phenotypic variance explained by novel SNPs.
Introduction
Although genome-wide association studies (GWAS) have identified thousands of genetic variants that associated with different complex diseases, a wide gap exists between the estimates of heritability and the heritability that are explained by the genetic variants via GWAS[1]. The potential reasons for the missing heritability include myriads of common variants with small effects yet to be found, rare variants and structure variants (insertions, deletions, duplications, inversions, translocations, and copy number variants) that are poorly detected by available genotyping arrays, and insufficient capability to detect epistasis effects, parental age effects, epigenetic effects, and gene-environment (G×E) interactions.
Yang et al. reported a joint estimate of all SNPs and found that their method (GCTA) can explain a large proportion of the heritability for human height[11]. Park et al. re-examined existing GWAS to estimate the number of susceptible loci and the distribution of their effect sizes. They used such estimates to ascertain power and sample size requirements for future new GWAS or meta-analyses[12]. Heritability on the liability scale estimated by GCTA ranged from 0.05 to 0.38 across 13 cancer types[13]. These studies argued that a large proportion of the missing heritability can be explained by common variants.
Previous study have demonstrated that 1000 Genomes-based imputation could identify both novel and refined association loci due to the increased density of marks[14][15]. We hypothesize that the increased density of GWAS marks will also facilitate the investigation of missing heritability without the need for additional genotyping or sequencing. We use IMPUTE2[16] for genotype imputation and then apply GCTA[17] to the association results before and after imputation to estimate the heritability of each disease.
Objective
We hypothesize that the 1000 Genomes-based imputation will increase the density of GWAS marks and will facilitate the investigation of missing heritability without the need for additional genotyping or sequencing.
Results & Discussion
After quality control, a total of 444,167 SNPs for 16,179 individuals were retained for the initial association analysis. These SNPs were used as the input genotype data for imputation. Approximately 2.7 million SNPs for each trait were used for association analysis after imputation. The estimation of the phenotypic variance explained by all SNPs with p-value less than 1×10-8 was performed using the restricted maximum likelihood (REML) analysis, which was implemented in GCTA[17].
Figure 1 shows the number of SNPs and the estimate of phenotypic variance was explained by these SNPs for the 6 traits. The numbers of SNPs that passed the significant threshold increased more than 10 times after imputation compared with the number before imputation. Before imputation, only several to 12.65 percent of the phenotypic variance was explained by the significant SNPs. After imputation, 25.52% to 56.28% of the phenotypic variance was explained by the significant SNPs.
SNPs with p-value less than 1×10-8 after imputation can explain 33.91% to 40.40% of BD phenotypic variance when different prevalence was used. The proportion of phenotypic variance explained by genetic variants in T1D was almost tripled in the 1000 Genomes imputation based association analysis than in the association analysis without imputation. The explained proportion of phenotypic variance were increased approximately 14 and 17 times in RA and CD, respectively. The proportion were increased even higher in CAD and T2D, about 62 and 95 times, respectively.
We then grouped SNPs with association p-value reached the genome wide significant level (1×10-8) after imputation but were not in LD (r2 >0.8) with any SNP with association p-value less than 1×10-5 before imputation as “novel” SNPs. The number of “novel” SNPs and the estimate of phenotypic variance explained by them for the 6 traits were listed in supplementary figure 2. The results suggested that the novel SNPs are the main reasons behind the increasing of heritability estimate.
Since, most variants have relatively small effect size, sample size of most studies were not big enough, and the limitation of current genotyping technology, more common variants with intermediate effect and rare variants may be with large effect are yet to be identified. These variants should be tractable through large meta-analysis and imputation based association analysis.
This is the first study that comprehensively examined the utility of 1000 Genomes based imputation for finding missing heritability. The proportion of phenotypic variance that was explained by genetic variants increased when the contribution of these new variants was included. These findings support that a larger number of variants are yet to be found. These variants are potential sources of missing heritability.
Conclusions
The new additional identified trait-associated variants identified through 1000 Genomes-based imputation can explain part of missing heritability.
Limitations
One potential problem is that the heritability estimates produced by GCTA is sensitive to the chosen sample and may be biased[18]. Although the 1000 Genomes based imputation increased the proportion of phenotypic variance explained by genetic variants, a substantial proportion of heritability remains unexplained for these diseases.
The next-generation sequencing data will accelerate the process of exploring missing heritability. With the rapid increase of the implementation of next-generation sequencing technology, large-scale next-generation sequence data from well phenotyped individuals will be available. It will be a great opportunity to unveil the missing heritability unexplained by common variants that were not covered by current genome-wide association studies.
Methods
Genotype Data
Previously published GWAS genotype data from the WTCCC[19] were used. The 500K Affymetrix chip genotype data included 1,500 individuals from the 1958 British Birth Cohort, 1,500 individuals from the UK Blood Services controls, and about 2,000 cases for each of 6 common diseases, namely, bipolar disorder (BD), coronary artery disease (CAD), Crohn’s disease (CD), rheumatoid arthritis (RA), type 1 diabetes (T1D), and type 2 diabetes (T2D). The genotype calls that were generated by the CHIAMO algorithm were downloaded.
Quality Control
For each dataset, the same recommended quality control (QC) thresholds removed the samples listed in the “exclusion-list” files. SNPs with a value of 0 for “good clustering” variable and SNPs listed in the “exclusion-list-snps” files in the data repository were excluded. The samples and SNPs lists were downloaded from European Genotype Archive (http://www.ebi.ac.uk/ega). Individuals with discordant sex information were excluded. Only one individual with higher call rate in the cryptically related individuals was kept. We also excluded SNPs with Hardy-Weinberg exact p-value less than 10-3 in the combined control groups. Only SNPs on autosome were used in this study. The EIGENSTRAT[20] was used to correct the population stratification.
Genotype imputation
The genotypes were imputed by IMPUTE2 (version 2.2.2)[16] using the 1000 Genomes[21] phased haplotype as reference panel (download from IMPUTE2 webpage). The recommended standard approach with default parameters was used. Firstly, pre-phasing step produces best-guess haplotypes from the genotypes, then impute into the estimated GWAS haplotypes in the second step. Each chromosome was split into smaller chunks of 5Mb. Only SNPs passed info score filter of 0.9 were used to keep the high-quality genotypes using IPGWAS[22]. SNPs have minor allele frequency (MAF) less than 0.05 or missing rate larger than 0.05 were excluded from association analysis.
Statistical analysis
PLINK[23] was used to run the association test. GCTA[17] was used to estimate the phenotypic variance that was explained by SNPs with allelic p-value less than 1.0×10-8. We reviewed published papers to find the prevalence for the 6 common diseases.
Funding Statement
This work was funded by grants from NSFC (No. 81271226), the Research Grant Council of Hong Kong (HKU775208M, HKU 777212M), the Research Fund for the Control of Infectious Diseases of Hong Kong (No.11101032), and the Health and Medical Research Fund of Hong Kong Government (HMRF) (No: 01121726).
Acknowledgements
We acknowledge the WTCCC for making the data available.
Conflict Of Interest
The authors declare no conflicts of interest.
Ethics Statement
Not applicable.
No fraudulence is committed in performing these experiments or during processing of the data. We understand that in the case of fraudulence, the study can be retracted by ScienceMatters.