Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Apr 7;11(4):e1004969.
doi: 10.1371/journal.pgen.1004969. eCollection 2015 Apr.

Simultaneous discovery, estimation and prediction analysis of complex traits using a bayesian mixture model

Affiliations

Simultaneous discovery, estimation and prediction analysis of complex traits using a bayesian mixture model

Gerhard Moser et al. PLoS Genet. .

Abstract

Gene discovery, estimation of heritability captured by SNP arrays, inference on genetic architecture and prediction analyses of complex traits are usually performed using different statistical models and methods, leading to inefficiency and loss of power. Here we use a Bayesian mixture model that simultaneously allows variant discovery, estimation of genetic variance explained by all variants and prediction of unobserved phenotypes in new samples. We apply the method to simulated data of quantitative traits and Welcome Trust Case Control Consortium (WTCCC) data on disease and show that it provides accurate estimates of SNP-based heritability, produces unbiased estimators of risk in new samples, and that it can estimate genetic architecture by partitioning variation across hundreds to thousands of SNPs. We estimated that, depending on the trait, 2,633 to 9,411 SNPs explain all of the SNP-based heritability in the WTCCC diseases. The majority of those SNPs (>96%) had small effects, confirming a substantial polygenic component to common diseases. The proportion of the SNP-based variance explained by large effects (each SNP explaining 1% of the variance) varied markedly between diseases, ranging from almost zero for bipolar disorder to 72% for type 1 diabetes. Prediction analyses demonstrate that for diseases with major loci, such as type 1 diabetes and rheumatoid arthritis, Bayesian methods outperform profile scoring or mixed model approaches.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Comparison of causal variant identification accuracy of BayesR, BSLM, LMM and single-SNP analysis in simulated data.
Shown is the true positive rate as a function of false positive rate for correct identification of regions (250kb) containing causative SNPs. Simulations are based on real SNP data of 3,924 individuals genotyped for 287,854 SNPs. The total number of causative SNPs was 3,000 with 10 (solid line), 310 (dotted line) and 2,680 effects sampled from a zero mean normal distribution with variance 10−2, 10−3, and 10−4, respectively. Trait heritabilities (h 2) were 0.2, 0.5 and 0.8.
Fig 2
Fig 2. Comparison of performance of BayesR, BSLMM, LMM and GPRS in simulated data.
(A) Distribution of SNP-based heritability estimates. The horizontal lines indicate the true heritability. GPRS does not provide estimates of heritability. (B) Distribution of the correlation coefficient between true and predicted phenotype. Simulations are based on real SNP data of 3,924 individuals genotyped for 287,854 SNPs. The total number of causative SNPs was 3,000 with 10, 310 and 2,680 effects sampled from a zero mean normal distribution with variance 10−2, 10−3, and 10−4, respectively. Trait heritabilities (h 2) were 0.2, 0.5 and 0.8. The single boxplots display the variation in estimates among 50 replicates.
Fig 3
Fig 3. Genetic architecture inferred using BayesR and BSLMM in simulated data.
Shown is the proportion of total genetic variance explained by each mixture component for BayesR and the relative contribution of SNPs with an effect above the polygenic component for BSLMM. Simulations are based on real SNP data of 3,924 individuals genotyped for 287,854 SNPs. The total number of causative SNPs was 3,000 with 10, 310, and 2,680 effects sampled from a zero mean normal distribution with variance 10−2, 10−3, and 10−4, respectively. The horizontal lines indicate the expected contribution by the 10 (right), 310 (middle) and 2,680 (left) SNPs to the genetic variance. Trait heritabilities (h 2) were 0.2, 0.5 and 0.8. The single boxplots display the variation in estimates among 50 replicates.
Fig 4
Fig 4. Comparison of performance of BayesR, BSLMM, LMM and GPRS in WTCCC data.
(A) Estimates of SNP-based heritability on the observed scale. Antennas are standard deviations of posterior samples for BayesR and BSLMM or standard errors for LMM. GPRS does not provide estimates of heritability. (B) Distribution of the area under the curve (AUC). The single boxplots display the variation in estimates among 20 replicates. In each replicate, the data set was randomly split into a training sample containing 80% of individuals and a validation sample containing the remaining 20%.
Fig 5
Fig 5. Genetic architecture underlying seven traits in WTCCC inferred using BayesR.
Proportion of additive genetic variation contributed by SNPs with different effect sizes. The colored bars partition the genetic variance in contributions from each mixture class. The proportion of variance in each mixture component was calculated as the sum of the square of the sampled effect sizes of the SNPs allocated to each component divided by the sum of the total variance explained by SNPs.
Fig 6
Fig 6. Proportion of genetic variance on each chromosome explained by SNPs with different effect sizes underlying seven traits in WTCCC.
Proportion of additive genetic variation contributed by individual chromosomes and the proportion of variance on each chromosome explained by SNPs with different effect sizes. For each chromosome we calculated the proportion of variance in each mixture component as the sum of the square of the sampled effect sizes of the SNPs allocated to each component divided by the sum of the total variance explained by SNPs. The colored bars partition the genetic variance in contributions from each mixture class.

References

    1. Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ (2008) Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies. PLoS Genet 4. - PMC - PubMed
    1. de los Campos G, Gianola D, Allison DB (2010) Predicting genetic predisposition in humans: the promise of whole-genome markers. Nat Rev Genet 11: 880–886. 10.1038/nrg2898 - DOI - PubMed
    1. Beavis WD (1998) QTL analysis: Power, precision, and accuracy In: Paterson AH, editor. Molecular dissection of complex traits. Boca Raton, FL: CRC Press.
    1. Chatterjee N, Wheeler B, Sampson J, Hartge P, Chanock SJ, et al. (2013) Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat Genet 45: 400–405, 405e401–403. 10.1038/ng.2579 - DOI - PMC - PubMed
    1. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, et al. (2010) Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42: 565–569. 10.1038/ng.608 - DOI - PMC - PubMed

Publication types