. 2011 Jun 18:12:246.

doi: 10.1186/1471-2105-12-246.

Enhancements to the ADMIXTURE algorithm for individual ancestry estimation

David H Alexander¹, Kenneth Lange

Affiliations

PMID: 21682921
PMCID: PMC3146885
DOI: 10.1186/1471-2105-12-246

Enhancements to the ADMIXTURE algorithm for individual ancestry estimation

David H Alexander et al. BMC Bioinformatics. 2011.

. 2011 Jun 18:12:246.

doi: 10.1186/1471-2105-12-246.

Authors

David H Alexander¹, Kenneth Lange

Affiliation

¹ Department of Biomathematics, UCLA, Los Angeles, California, USA. dalexander@ucla.edu

PMID: 21682921
PMCID: PMC3146885
DOI: 10.1186/1471-2105-12-246

Abstract

Background: The estimation of individual ancestry from genetic data has become essential to applied population genetics and genetic epidemiology. Software programs for calculating ancestry estimates have become essential tools in the geneticist's analytic arsenal.

Results: Here we describe four enhancements to ADMIXTURE, a high-performance tool for estimating individual ancestries and population allele frequencies from SNP (single nucleotide polymorphism) data. First, ADMIXTURE can be used to estimate the number of underlying populations through cross-validation. Second, individuals of known ancestry can be exploited in supervised learning to yield more precise ancestry estimates. Third, by penalizing small admixture coefficients for each individual, one can encourage model parsimony, often yielding more interpretable results for small datasets or datasets with large numbers of ancestral populations. Finally, by exploiting multiple processors, large datasets can be analyzed even more rapidly.

Conclusions: The enhancements we have described make ADMIXTURE a more accurate, efficient, and versatile tool for ancestry estimation.

PubMed Disclaimer

Figures

**Figure 1**
**Cross-validation (CV) of three datasets derived from the HapMap 3 resource using v = 5 folds**. After subsampling 13,928 markers to minimize linkage disequilibrium, we separately cross-validated datasets containing unrelated individuals from the (a) CEU, (b) CEU, ASW, and YRI, and (c) CEU, ASW, YRI, and MEX HapMap 3 subsamples. Plots display CV error versus K. CV for the CEU dataset suggests K = 1 is the best fit, agreeing with intuition; K = 2 is the best fit for the CEU+ASW+YRI dataset, which contains European, African, and admixed African-American samples; K = 3 is the best fit for CEU+ASW+YRI+MEX, which additionally contains Mexican-Americans.

**Figure 2**
**Errors in estimating ancestral allele frequencies lead to bias in estimating ancestry fractions (Q), with many individuals ascribed too much admixture**. The plot shows an estimate of the relationship between the true ancestry fraction *q_i*₁(fraction of ancestry attributed to population 1) and the resulting estimate as determined via a nonparametric regression (LOESS) model fitted to the results from analyses of 100 simulated datasets. Reference individuals are excluded from the plots and regression analyses. The dotted line y = x is tracked closely by the conditional mean of supervised estimates, suggesting little bias. However, in panel (a) (simulations with *F_ST*= .01) the conditional mean of the unsupervised estimates deviates substantially, exhibiting an upward bias for low *q_i*₁and a downward bias for high *q_i*₁. The bias is mitigated using simulations with *F_ST*= .05, as shown in panel (b), or by using a larger number of markers (J = 300, 000, not shown).

formula image — **Figure 2**
**Errors in estimating ancestral allele frequencies lead to bias in estimating ancestry fractions (Q), with many individuals ascribed too much admixture**. The plot shows an estimate of the relationship between the true ancestry fraction *q_i*₁(fraction of ancestry attributed to population 1) and the resulting estimate as determined via a nonparametric regression (LOESS) model fitted to the results from analyses of 100 simulated datasets. Reference individuals are excluded from the plots and regression analyses. The dotted line y = x is tracked closely by the conditional mean of supervised estimates, suggesting little bias. However, in panel (a) (simulations with *F_ST*= .01) the conditional mean of the unsupervised estimates deviates substantially, exhibiting an upward bias for low *q_i*₁and a downward bias for high *q_i*₁. The bias is mitigated using simulations with *F_ST*= .05, as shown in panel (b), or by using a larger number of markers (J = 300, 000, not shown).

**Figure 3**
**Penalized estimation can reduce the bias in ancestry estimates that appears for small marker sets or closely related ancestral populations**. We applied penalized estimation to the simulated dataset of 10,000 SNP markers from admixed individuals from two populations differentiated by *F_ST*= .01. Panel (a) shows that 5-fold cross-validation selects λ = 5 as the optimal strength of penalization. The results of penalization with λ = 5 are compared, in panel (b), with the maximum likelihood (unsupervised) estimates and with the supervised estimates, all visualized via nonparametric regression as in Figure 2. Reference individuals are excluded from the regression models.

See this image and copyright information in PMC

References

1. Pritchard J, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155(2):945. - PMC - PubMed
1. Tang H, Peng J, Wang P, Risch N. Estimation of individual admixture: analytical and study design considerations. Genetic Epidemiology. 2005;28(4):289–301. doi: 10.1002/gepi.20064. - DOI - PubMed
1. Zhou H, Alexander D, Lange K. A quasi-Newton acceleration for high-dimensional optimization algorithms. Statistics and Computing Published online. 2009;19(4) - PMC - PubMed
1. Alexander D, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Research. 2009;19:1655–1664. doi: 10.1101/gr.094052.109. - DOI - PMC - PubMed
1. Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. American Journal of Human Genetics. 2006;78(4):629–644. doi: 10.1086/502802. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Enhancements to the ADMIXTURE algorithm for individual ancestry estimation

Affiliation

Enhancements to the ADMIXTURE algorithm for individual ancestry estimation

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases