Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Aug;8(8):e1002886.
doi: 10.1371/journal.pgen.1002886. Epub 2012 Aug 23.

A quantitative comparison of the similarity between genes and geography in worldwide human populations

Affiliations

A quantitative comparison of the similarity between genes and geography in worldwide human populations

Chaolong Wang et al. PLoS Genet. 2012 Aug.

Abstract

Multivariate statistical techniques such as principal components analysis (PCA) and multidimensional scaling (MDS) have been widely used to summarize the structure of human genetic variation, often in easily visualized two-dimensional maps. Many recent studies have reported similarity between geographic maps of population locations and MDS or PCA maps of genetic variation inferred from single-nucleotide polymorphisms (SNPs). However, this similarity has been evident primarily in a qualitative sense; and, because different multivariate techniques and marker sets have been used in different studies, it has not been possible to formally compare genetic variation datasets in terms of their levels of similarity with geography. In this study, using genome-wide SNP data from 128 populations worldwide, we perform a systematic analysis to quantitatively evaluate the similarity of genes and geography in different geographic regions. For each of a series of regions, we apply a Procrustes analysis approach to find an optimal transformation that maximizes the similarity between PCA maps of genetic variation and geographic maps of population locations. We consider examples in Europe, Sub-Saharan Africa, Asia, East Asia, and Central/South Asia, as well as in a worldwide sample, finding that significant similarity between genes and geography exists in general at different geographic levels. The similarity is highest in our examples for Asia and, once highly distinctive populations have been removed, Sub-Saharan Africa. Our results provide a quantitative assessment of the geographic structure of human genetic variation worldwide, supporting the view that geography plays a strong role in giving rise to human population structure.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Procrustes analysis of genetic and geographic coordinates of worldwide populations.
(A) Geographic coordinates of 53 populations. (B) Procrustes-transformed PCA plot of genetic variation. The Procrustes analysis is based on the Gall-Peters projected coordinates of geographic locations and PC1-PC2 coordinates of 938 individuals. The figures are plotted according to the Gall-Peters projection. PC1 and PC2 are indicated by dotted lines, crossing over the centroid of all individuals. PC1 and PC2 account for 6.22% and 4.72% of the total variance, respectively. The Procrustes similarity is formula image (formula image). The rotation angle of the PCA map is formula image.
Figure 2
Figure 2. Procrustes analysis of genetic and geographic coordinates of European populations.
(A) Geographic coordinates of 37 populations. (B) Procrustes-transformed PCA plot of genetic variation. The Procrustes analysis is based on the unprojected latitude-longitude coordinates and PC1-PC2 coordinates of 1378 individuals. PC1 and PC2 are indicated by dotted lines, crossing over the centroid of all individuals. Abbreviations are as follows: AL, Albania; AT, Austria; BA, Bosnia-Herzegovina; BE, Belgium; BG, Bulgaria; CH-F, Swiss-French; CH-G, Swiss-German; CH-I, Swiss-Italian; CY, Cyprus; CZ, Czech Republic; DE, Germany; DK, Denmark; ES, Spain; FI, Finland; FR, France; GB, United Kingdom; GR, Greece; HR, Croatia; HU, Hungary; IE, Ireland; IT, Italy; KS, Kosovo; LV, Latvia; MK, Macedonia; NL, Netherlands; NO, Norway; PL, Poland; PT, Portugal; RO, Romania; RU, Russia; Sct, Scotland; SE, Sweden; SI, Slovenia; TR, Turkey; UA, Ukraine; YG, Serbia and Montenegro. Population labels follow the color scheme of Novembre et al. . PC1 and PC2 account for 0.30% and 0.16% of the total variance, respectively. The Procrustes similarity is formula image (formula image). The rotation angle of the PCA map is formula image.
Figure 3
Figure 3. Procrustes analysis of genetic and geographic coordinates of Sub-Saharan African populations, excluding hunter-gatherer populations and Mbororo Fulani.
(A) Geographic coordinates of 23 populations. (B) Procrustes-transformed PCA plot of genetic variation. The Procrustes analysis is based on the unprojected latitude-longitude coordinates and PC1-PC2 coordinates of 348 individuals. PC1 and PC2 are indicated by dotted lines, crossing over the centroid of all individuals. PC1 and PC2 account for 1.34% and 0.69% of the total variance, respectively. The Procrustes similarity is formula image (formula image). The rotation angle of the PCA map is formula image.
Figure 4
Figure 4. Procrustes analysis of genetic and geographic coordinates of Asian populations.
(A) Geographic coordinates of 44 populations. (B) Procrustes-transformed PCA plot of genetic variation. The Procrustes analysis is based on the unprojected latitude-longitude coordinates and PC1-PC2 coordinates of 749 individuals. PC1 and PC2 are indicated by dotted lines, crossing over the centroid of all individuals. PC1 and PC2 account for 5.42% and 0.85% of the total variance, respectively. The Procrustes similarity is formula image (formula image). The rotation angle of the PCA map is formula image.
Figure 5
Figure 5. Procrustes analysis of genetic and geographic coordinates of East Asian populations.
(A) Geographic coordinates of 23 populations. (B) Procrustes-transformed PCA plot of genetic variation. The Procrustes analysis is based on the unprojected latitude-longitude coordinates and PC1-PC2 coordinates of 334 individuals. PC1 and PC2 are indicated by dotted lines, crossing over the centroid of all individuals. PC1 and PC2 account for 1.58% and 0.98% of the total variance, respectively. The Procrustes similarity statistic is formula image (formula image). The rotation angle of the PCA map is formula image.
Figure 6
Figure 6. Procrustes analysis of genetic and geographic coordinates of Central/South Asian populations.
(A) Geographic coordinates of 18 populations. (B) Procrustes-transformed PCA plot of genetic variation. The Procrustes analysis is based on the unprojected latitude-longitude coordinates and PC1-PC2 coordinates of 362 individuals. PC1 and PC2 are indicated by dotted lines, crossing over the centroid of all individuals. PC1 and PC2 account for 1.59% and 1.31% of the total variance, respectively. The Procrustes similarity statistic is formula image (formula image). The rotation angle of the PCA map is formula image.
Figure 7
Figure 7. Histograms of the Procrustes similarity of 100,000 permutations for analyses in Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, and Figure 6 .
The blue vertical lines indicate the value of formula image. (A) The worldwide dataset in Figure 1 (formula image, formula image). (B) The European dataset in Figure 2 (formula image, formula image). (C) The Sub-Saharan African dataset in Figure 3 (formula image, formula image). (D) The Asian dataset in Figure 4 (formula image, formula image). (E) The East Asian dataset in Figure 5 (formula image, formula image). (F) The Central/South dataset in Figure 6 (formula image, formula image).
Figure 8
Figure 8. Procrustes analyses of genetic and geographic coordinates based on different numbers of loci.
The same sets of formula image randomly selected markers were used to generate PCA maps of genetic variation to compare with geographic maps for different regions. formula image.
Figure 9
Figure 9. Relationship between and the proportion of genetic variation explained by the first two components of the PCA.
Both the main analyses of the paper in Table 2 and the supplementary analyses of Sub-Saharan Africa, in which certain populations excluded from the main analysis are included, are considered in obtaining the regression line. The values on the x-axis were obtained by summing the proportions of variance explained by PC1 and PC2 (columns 2 and 3 in Table 2, columns 6 and 7 in Table S7). formula image values were estimated from the same datasets as used in the PCA (column 7 in Table 2, column 11 in Table S7). The dashed line indicates the linear least squares fit of formula image. The Pearson correlation is formula image.

References

    1. Sokal RR, Oden NL, Wilson C (1991) Genetic evidence for the spread of agriculture in Europe by demic diffusion. Nature 351: 143–145. - PubMed
    1. Cavalli-Sforza LL, Menozzi P, Piazza A (1994) The History and Geography of Human Genes. Princeton: Princeton University Press.
    1. Barbujani G (2000) Geographic patterns: how to identify them and why. Hum Biol 72: 133–153. - PubMed
    1. Cavalli-Sforza LL, Feldman MW (2003) The application of molecular genetic approaches to the study of human evolution. Nat Genet 33 Suppl:266–275. - PubMed
    1. Novembre J, Ramachandran S (2011) Perspectives on human population structure at the cusp of the sequencing era. Annu Rev Genomics Hum Genet 12: 245–274. - PubMed

Publication types