Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Oct;5(10):e1000686.
doi: 10.1371/journal.pgen.1000686. Epub 2009 Oct 16.

A genealogical interpretation of principal components analysis

Affiliations

A genealogical interpretation of principal components analysis

Gil McVean. PLoS Genet. 2009 Oct.

Abstract

Principal components analysis, PCA, is a statistical method commonly used in population genetics to identify structure in the distribution of genetic variation across geographical location and ethnic background. However, while the method is often used to inform about historical demographic processes, little is known about the relationship between fundamental demographic parameters and the projection of samples onto the primary axes. Here I show that for SNP data the projection of samples onto the principal components can be obtained directly from considering the average coalescent times between pairs of haploid genomes. The result provides a framework for interpreting PCA projections in terms of underlying processes, including migration, geographical isolation, and admixture. I also demonstrate a link between PCA and Wright's f(st) and show that SNP ascertainment has a largely simple and predictable effect on the projection of samples. Using examples from human genetics, I discuss the application of these results to empirical data and the implications for inference.

PubMed Disclaimer

Conflict of interest statement

The author has declared that no competing interests exist.

Figures

Figure 1
Figure 1. Genealogical statistics.
The chart shows a genealogical tree describing the history of a sample of size five. Two samples, formula image and formula image, will share a derived mutation (indicated by the circle) if it occurs on the branch between their most recent common ancestor and the common ancestor of the whole sample. The length of this branch is formula image.
Figure 2
Figure 2. Principal component analysis of two populations.
(A) Consider a sample of formula image individuals from population A (indicated by the red circle) and formula image from population B (indicated by the blue circle), where the two populations have the same effective population size of formula image and are both derived from a single ancestral population, also of size formula image, with the split happening a time formula image in the past. (B) The expected locations of these two sets of samples on the first PC is defined by the time since divergence (the Euclidean distance between the samples is formula image) (see text for definitions) and the relative sample size from the populations, with the larger sample lying closer to the origin. Defining formula image, the relative location of the two populations on the first PC are formula image for samples from population A and formula image for samples from population B (note that the sign is arbitrary). (C) To investigate the effect of finite genome size simulations were carried out for the model shown in part A with 80 genomes sampled from population A, 20 from population B and a split time of 0.02 formula image generations (formula image) and between formula image and formula image SNPs. Lines indicate the analytical expectation. A jitter has been added to the x-axis for clarity. Note that the separation of samples with 10 SNPs does not correlate with population and simply reflects random clustering arising from the small numbers of SNPs.
Figure 3
Figure 3. The effect of uneven sampling on PCA projection.
PCA projection of samples taken from a set of nine populations arranged in a lattice, each of which exchanges migrants at rate formula image per formula image generations with each adjoining neighbour, leads to a recovery of the migration-space if samples are of equal size (A), or a distortion of migration-space if populations are not equally represented (B,C). In each part the left-hand panel shows the analytical solution (the area of each point represents the relative sample size) with migration routes illustrated while the right-hand panel shows the result of a simulation with a total sample size of 180 and 10,000 independent SNP loci. All examples are for formula image.
Figure 4
Figure 4. Identification of admixture proportions without source populations.
Initially an admixed population is formed by random mating from two populations, each fixed for a different allele at each locus with 40% contribution from one population. In the simulated population there are 1000 individuals, each of which has 20 chromosomes with 50 markers each, a genetic map length of 1 per chromosome and a uniform recombination rate. Subsequent generations are formed by random mating of the ancestral population. (A) Projections of 100 randomly chosen samples on the first PC over time show a decay in the fraction of variance explained by the first PC (note that the total variance in the population decays little over the time-scale of the simulation). (B) Admxiture proportions for the same individuals as in part A (blue points) as well as the everage heterozygosity (red line) and the fraction of the variance in PC1 explained by admixture proportions (black line). While there is a strong association between admixture proportion and location on PC1 for the first few generations, after 15 generations recombination has eliminated any signal, even though there is still strong admixture LD between nearby markers (data not shown).
Figure 5
Figure 5. Admixture proportions inferred from PCA projections.
(A) For each of the autosomes (chromosome 1 is the lowest) the points indicate the locations of sampled haplotypes (the transmitted and untransmitted haplotypes inferred from trios) on the first principal component (each chromosome is analysed separately; blue = CEU, orange = YRI, green = ASW). Importantly, PCA is carried out only on the haplotypes from CEU and YRI and all samples are subsequently projected onto the first PC identified from this analysis. Lines connect the transmitted (or untransmitted) haplotypes for each individual across chromosomes. Note the uniformity of the locations of samples on the first PC for CEU and YRI. Individual chromosomes within the ASW, however, show a great range of locations on the first PC. (B) The genome-wide admixture proportions (separately for transmitted and untransmitted chromosomes) can be inferred directly from the location of admixed samples on the first PC between the two source populations. Colours are as for (A). The vertical spacing of points is arbitrary.
Figure 6
Figure 6. The effect of SNP ascertainment on PCA projection.
(A) In the joint genealogy of the ascertainment (black circles) and genotyped samples (grey circles), only mutations occurring on the intersection of the two genealogies (shown in black) will be detected in both samples. For small discovery panels and large experimental samples, this may be considerably less than half the total genealogy length. (B) Model used to simulate data from three populations linked by two vicariance events, each of which is associated with a bottleneck; the model is an approximation to the demographic history of the HapMap populations ,. In the simulations 100 haploid genomes with 10,000 unlinked loci were sampled from each population and the parameters are formula image, formula image, formula image, formula image, where formula image is the bottleneck strength measured as the probability that two lineages entering the bottleneck have coalesced by its end (the bottleneck is instantaneous in real time). All populations have the same effective population size. (C) PCA of the simulated data (small open circles) shows strong agreement with results obtained from analytical consideration of the expected coalescence times (large circles). When only those SNPs that have been discovered in a small panel are considered (here modelled as 4, 8, and 4 additional samples from populations I, II, and III respectively) the principal effect is to scale the locations of the samples on the first two PCs (small filled circles) by a factor of approximately formula image (large diamonds).

Similar articles

Cited by

References

    1. Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2:e190. doi: 10.1371/journal.pgen.0020190. - DOI - PMC - PubMed
    1. Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, et al. Genes mirror geography within Europe. Nature. 2008;456:98–101. - PMC - PubMed
    1. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. - PubMed
    1. Cavalli-Sforza LL, Menozzi P, Piazza A. The History and Geography of Human Genes. New Jersey: Princeton; 1994.
    1. Reich D, Price AL, Patterson N. Principal component analysis of genetic data. Nat Genet. 2008;40:491–492. - PubMed

Publication types