. 2013 Jan 23:14:28.

doi: 10.1186/1471-2105-14-28.

A fast least-squares algorithm for population inference

R Mitchell Parry¹, May D Wang

Affiliations

PMID: 23343408
PMCID: PMC3602075
DOI: 10.1186/1471-2105-14-28

A fast least-squares algorithm for population inference

R Mitchell Parry et al. BMC Bioinformatics. 2013.

. 2013 Jan 23:14:28.

doi: 10.1186/1471-2105-14-28.

Authors

R Mitchell Parry¹, May D Wang

Affiliation

¹ The Wallace H, Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332, USA.

PMID: 23343408
PMCID: PMC3602075
DOI: 10.1186/1471-2105-14-28

Abstract

Background: Population inference is an important problem in genetics used to remove population stratification in genome-wide association studies and to detect migration patterns or shared ancestry. An individual's genotype can be modeled as a probabilistic function of ancestral population memberships, Q, and the allele frequencies in those populations, P. The parameters, P and Q, of this binomial likelihood model can be inferred using slow sampling methods such as Markov Chain Monte Carlo methods or faster gradient based approaches such as sequential quadratic programming. This paper proposes a least-squares simplification of the binomial likelihood model motivated by a Euclidean interpretation of the genotype feature space. This results in a faster algorithm that easily incorporates the degree of admixture within the sample of individuals and improves estimates without requiring trial-and-error tuning.

Results: We show that the expected value of the least-squares solution across all possible genotype datasets is equal to the true solution when part of the problem has been solved, and that the variance of the solution approaches zero as its size increases. The Least-squares algorithm performs nearly as well as Admixture for these theoretical scenarios. We compare least-squares, Admixture, and FRAPPE for a variety of problem sizes and difficulties. For particularly hard problems with a large number of populations, small number of samples, or greater degree of admixture, least-squares performs better than the other methods. On simulated mixtures of real population allele frequencies from the HapMap project, Admixture estimates sparsely mixed individuals better than Least-squares. The least-squares approach, however, performs within 1.5% of the Admixture error. On individual genotypes from the HapMap project, Admixture and least-squares perform qualitatively similarly and within 1.2% of each other. Significantly, the least-squares approach nearly always converges 1.5- to 6-times faster.

Conclusions: The computational advantage of the least-squares approach along with its good estimation performance warrants further research, especially for very large datasets. As problem sizes increase, the difference in estimation performance between all algorithms decreases. In addition, when prior information is known, the least-squares approach easily incorporates the expected degree of admixture to improve the estimate.

PubMed Disclaimer

Figures

**Figure 1**
**Bound on total variance.** Solid and dashed lines correspond to the empirical estimate of the total variance and the upper bound for total variance, respectively.

**Figure 2**
**Precision of best-case scenario for estimating P.** Root mean squared error for different values of p using (a) *Admixture*’s Sequential Quadratic Programming or (b) the least-squares approximation.

**Figure 3**
**Precision of best-case scenario for estimating Q.** Solid and dashed lines correspond to *Admixture*’s Sequential Quadratic Programming optimization and the least-squares approximation, respectively.

**Figure 4**
**Computational timing comparison.** Box plots show the median (red line) and inter-quartile range (blue box) for computation time on a logarithmic scale using (a) N=1000, α=0.5, and varying K; (b) K=4, α=0.5, and varying N; and (c) K=4, N=1000, and varying α.

**Figure 5**
**Comparison on HapMap Phase 3 dataset.** Inferred population membership proportions using (a) *Admixture* and (b) least-squares with α=1. Each point represents a different individual among the four populations: ASW, CEU, MEX, and YRI. The axes represent the proportion of each individual’s genome originating from each inferred population. The proportion belonging to the third inferred population is given by q₃= 1 – q₁– q₂.

**Figure 6**
**First-order approximation for slope of log-likelihood of** m. Solid and dashed lines correspond to the true and approximated slope, respectively. The red, green, and blue lines correspond to g = 0, g = 1, and g = 2, respectively.

**Figure 7**
**First-order approximation for slope of log-likelihood of** q. Solid and dashed lines correspond to the true and approximated slope, respectively, for K = 2. The blue, green, red, and orange lines correspond to α = 0.1, α = 0.5, α = 1, and α = 2, respectively.

See this image and copyright information in PMC

Cited by

Fast and efficient estimation of individual ancestry coefficients.
Frichot E, Mathieu F, Trouillon T, Bouchard G, François O. Frichot E, et al. Genetics. 2014 Apr;196(4):973-83. doi: 10.1534/genetics.113.160572. Epub 2014 Feb 4. Genetics. 2014. PMID: 24496008 Free PMC article.

References

1. Beaumont M, Barratt EM, Gottelli D, Kitchener AC, Daniels MJ, Pritchard JK, Bruford MW. Genetic diversity and introgression in the Scottish wildcat. Mol Ecol. 2001;10:319–336. doi: 10.1046/j.1365-294x.2001.01196.x. - DOI - PubMed
1. Novembre J, Ramachandran S. Perspectives on human population structure at the cusp of the sequencing era. Annu Rev Genomics Hum Genet. 2011;12 - PubMed
1. Menozzi P, Piazza A, Cavalli-Sforza L. Synthetic maps of human gene frequencies in Europeans. Science. 1978;201:786–792. doi: 10.1126/science.356262. - DOI - PubMed
1. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. doi: 10.1038/ng1847. - DOI - PubMed
1. McVean G. A genealogical interpretation of principal components analysis. PLoS Genet. 2009;5:e1000686. doi: 10.1371/journal.pgen.1000686. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A fast least-squares algorithm for population inference

Affiliation

A fast least-squares algorithm for population inference

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials