Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Oct;107(5):413-20.
doi: 10.1038/hdy.2011.26. Epub 2011 Mar 30.

Investigating population stratification and admixture using eigenanalysis of dense genotypes

Affiliations

Investigating population stratification and admixture using eigenanalysis of dense genotypes

D Shriner. Heredity (Edinb). 2011 Oct.

Abstract

Principal components analysis of genetic data is used to avoid inflation in type I error rates in association testing due to population stratification by covariate adjustment using the top eigenvectors and to estimate cluster or group membership independent of self-reported or ethnic identities. Eigendecomposition transforms correlated variables into an equal number of uncorrelated variables. Numerous stopping rules have been developed to identify which principal components should be retained. Recent developments in random matrix theory have led to a formal hypothesis test of the top eigenvalue, providing another way to achieve dimension reduction. In this study, I compare Velicer's minimum average partial test to a test on the basis of Tracy-Widom distribution as implemented in EIGENSOFT, the most widely used implementation of principal components analysis in genome-wide association analysis. By computer simulation of vicariance on the basis of coalescent theory, EIGENSOFT systematically overestimates the number of significant principal components. Furthermore, this overestimation is larger for samples of admixed individuals than for samples of unadmixed individuals. Overestimating the number of significant principal components can potentially lead to a loss of power in association testing by adjusting for unnecessary covariates and may lead to incorrect inferences about group differentiation. Velicer's minimum average partial test is shown to have both smaller bias and smaller variance, often with a mean squared error of 0, in estimating the number of principal components to retain. Velicer's minimum average partial test is implemented in R code and is suitable for genome-wide genotype data with or without population labels.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Genealogical representation of the coalescent simulations. (a) Two populations with a single divergence event 2tNe generations ago. (b) Three populations with the first divergence event 2t1Ne generations ago and the second divergence event 2t2Ne generations ago.
Figure 2
Figure 2
Representative projections of simulated data for two populations. (ac) The divergence event between populations A (red circles) and B (blue circles) occurred 0 generations ago. (df) The divergence event occurred 2Ne generations ago. (a, d) Analysis of populations A and B. (b, e) Analysis of admixed individuals (gray circles) with average individual admixture proportions 78.2% population A and 21.8% population B. (c, f) Combined analysis of admixed individuals, population A and population B.
Figure 3
Figure 3
Representative projections of simulated data for three populations. (ac) The divergence event between populations B (blue circles) and C (black circles) occurred 0.0002Ne generations ago and the divergence of population A (red circles) occurred 0.002Ne generations ago. (df) The divergence event between populations B and C occurred 2Ne generations ago and the divergence of population A occurred 20Ne generations ago. (a, d) Analysis of populations A, B and C. (b, e) Analysis of admixed individuals (gray circles) with average individual admixture proportions 10% population A, 45% population B and 45% population C. (c, f) Combined analysis of admixed individuals, population A, population B and population C.
Figure 4
Figure 4
Top 16 principal components for the Howard University Family Study data using EIGENSOFT. All 16 principal components are statistically significant according to Tracy–Widom statistics. The bottom right panel shows the scree plot.
Figure 5
Figure 5
Top 16 principal components for the Howard University Family Study data using Velicer's minimum average partial test. Only the top principal component is statistically significant. The bottom right panel shows the scree plot.

Similar articles

Cited by

References

    1. Adeyemo A, Gerry N, Chen G, Herbert A, Doumatey A, Huang H, et al. A genome-wide association study of hypertension and blood pressure in African Americans. PLoS Genet. 2009;5:e1000564. - PMC - PubMed
    1. Chen G, Shriner D, Zhou J, Doumatey A, Huang H, Gerry NP, et al. Development of admixture mapping panels for African Americans from commercial high-density SNP arrays. BMC Genomics. 2010;11:417. - PMC - PubMed
    1. The International HapMap 3 Consortium Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467:52–58. - PMC - PubMed
    1. Engelhardt BE, Stephens M. Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLoS Genet. 2010;6:e1001117. - PMC - PubMed
    1. Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003;164:1567–1587. - PMC - PubMed

Publication types