Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jun 17;44(7):825-30.
doi: 10.1038/ng.2314.

An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations

Affiliations

An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations

Vincent Segura et al. Nat Genet. .

Abstract

Population structure causes genome-wide linkage disequilibrium between unlinked loci, leading to statistical confounding in genome-wide association studies. Mixed models have been shown to handle the confounding effects of a diffuse background of large numbers of loci of small effect well, but they do not always account for loci of larger effect. Here we propose a multi-locus mixed model as a general method for mapping complex traits in structured populations. Simulations suggest that our method outperforms existing methods in terms of power as well as false discovery rate. We apply our method to human and Arabidopsis thaliana data, identifying new associations and evidence for allelic heterogeneity. We also show how a priori knowledge from an A. thaliana linkage mapping study can be integrated into our method using a Bayesian approach. Our implementation is computationally efficient, making the analysis of large data sets (n > 10,000) practicable.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A GWAS for a simulated trait with two causal SNPs (marked by vertical lines), randomly chosen from a real A. thaliana SNP dataset. Random error was added to the trait to fix the heritability at 25%. (a) A single-SNP linear regression scan detects four significantly associated SNPs (at a Bonferroni-corrected threshold of 0.05; dashed horizontal line) marked in red. Half of these SNPs are false positives and the other half true positives, leading to a false discovery rate (FDR) of 50% and a power of 100%. (b) A single-SNP mixed-model, scan eliminates one false positive but also one true positive, leading to a similar (50%) FDR while decreasing the power to 50%. (c) Adding the most significant SNP as a cofactor to the mixed model (marked in orange) recovers the second causal SNP while eliminating the last false positive, leading to the perfect case of a FDR of 0% and a power of 100%.
Figure 2
Figure 2
Power and false discovery rate (FDR) in the 100-locus model simulations for four different mapping methods: linear model (LM), stepwise linear model (SWLM), mixed-model (MM), and multi-locus mixed-model (MLMM). For the purpose of computing power and FDR, a causal SNP was considered detected if a SNP within 25kb on either side was declared significant (results for other window sizes are given in Supplementary Fig. 3), and only causal SNPs that were in principle detectable (i.e., that were marginally significant at a Bonferroni-corrected threshold of 0.05 in a simple linear model were considered. For clarity, only the backward path of the multi-locus methods (SWLM and MLMM) is shown: a comparison between forward and backward paths is given in Supplementary Fig. 4. Circles and triangles denote the best-fitting model according to the Bonferroni and EBIC model-selection criteria, respectively. Three phenotypic heritabilities were used in the simulations: 0.25 (a, d), 0.50 (b, e), and 0.75 (c, f). Power and FDR was estimated with (a–c) and without (d–f) the causal loci included.
Figure 3
Figure 3
GWAS for low-density lipoprotein (LDL) in the NFBC1966 dataset. (a) A single-locus mixedmodel identifies seven SNPs in three genes (marked in red; Bonferroni-corrected threshold of 0.05; dashed horizontal line). (b) A multi-locus mixed-model (MLMM) identifies five SNPs in four genes (marked in orange, and numbered in the order they were included in the model). (c) Partition of variance at each step of MLMM (10 forward and 10 backward) into variance explained by: the SNPs included in the model (blue); kinship (green); and noise (yellow).
Figure 4
Figure 4
GWAS for Na+ accumulation in A. thaliana. (a) A single-locus mixed-model identifies a strong peak of significantly associated SNPs on chromosome 4 (marked in red; Bonferroni-corrected threshold of 0.05; dashed horizontal line). (b) Multi-locus mixed-model (MLMM) identifies three SNPs (marked in orange, and numbered in the order they were included in the model). (c) Partition of variance at each step of MLMM (8 forward and 8 backward) into variance explained by: the SNPs included in the model (blue); kinship (green); and noise (yellow).
Figure 5
Figure 5
An example of Bayesian multi-locus mixed-model (MLMM) for the analysis of FLOWERING LOCUS C (FLC) expression in A. thaliana. (a) An approximate mixed-model scan for FLC expression, marking the FRIGIDA gene with a vertical grey line. (b) The posterior probability of association scan after the Bayesian MLMM has included two loci into the model, which incidentally are the two previously identified causative indels. (c) Partition of phenotypic variance for each forward inclusion (10 steps) and backwards elimination (10 steps after the dotted line). The vertical red line marks the model with the two causative indels in the model.

References

    1. Cardon LR, Palmer LJ. Population stratification and spurious allelic association. Lancet. 2003;361:598–604. - PubMed
    1. Marchini J, Cardon LR, Phillips MS, Donnelly P. The effects of human population structure on large genetic association studies. Nature Genetics. 2004;36:512–517. - PubMed
    1. Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. - PubMed
    1. Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. Association mapping in structured populations. American Journal of Human Genetics. 2000;67:170–181. - PMC - PubMed
    1. Price AL, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006;38:904–909. - PubMed

ADDITIONAL REFERENCES

    1. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer; New York: 2009.
    1. Kass RE, Raftery AE. Bayes Factors. Journal of the American Statistical Association. 1995;90:773–795.

Publication types