. 2006 Apr;78(4):680-90.

doi: 10.1086/501531. Epub 2006 Feb 14.

Proportioning whole-genome single-nucleotide-polymorphism diversity for the identification of geographic population structure and genetic ancestry

Oscar Lao¹, Kate van Duijn, Paula Kersbergen, Peter de Knijff, Manfred Kayser

Affiliations

PMID: 16532397
PMCID: PMC1424693
DOI: 10.1086/501531

Proportioning whole-genome single-nucleotide-polymorphism diversity for the identification of geographic population structure and genetic ancestry

Oscar Lao et al. Am J Hum Genet. 2006 Apr.

. 2006 Apr;78(4):680-90.

doi: 10.1086/501531. Epub 2006 Feb 14.

Authors

Oscar Lao¹, Kate van Duijn, Paula Kersbergen, Peter de Knijff, Manfred Kayser

Affiliation

¹ Department of Forensic Molecular Biology, Erasmus University Medical Centre Rotterdam, Rotterdam, The Netherlands.

PMID: 16532397
PMCID: PMC1424693
DOI: 10.1086/501531

Abstract

The identification of geographic population structure and genetic ancestry on the basis of a minimal set of genetic markers is desirable for a wide range of applications in medical and forensic sciences. However, the absence of sharp discontinuities in the neutral genetic diversity among human populations implies that, in practice, a large number of neutral markers will be required to identify the genetic ancestry of one individual. We showed that it is possible to reduce the amount of markers required for detecting continental population structure to only 10 single-nucleotide polymorphisms (SNPs), by applying a newly developed ascertainment algorithm to Affymetrix GeneChip Mapping 10K SNP array data that we obtained from samples of globally dispersed human individuals (the Y Chromosome Consortium panel). Furthermore, this set of SNPs was able to recover the genetic ancestry of individuals from all four continents represented in the original data set when applied to an independent, much larger, worldwide population data set (Centre d'Etude du Polymorphisme Humain-Human Genome Diversity Project Cell Line Panel). Finally, we provide evidence that the unusual patterns of genetic variation we observed at the respective genomic regions surrounding the five most informative SNPs is in agreement with local positive selection being the explanation for the striking SNP allele-frequency differences we found between continental groups of human populations.

PubMed Disclaimer

Figures

**Figure 1**
Percentage of information explained when the number of markers that are ascertained from 8,491 SNPs by use of the genetic algorithm based on the informativeness of assignment index (I_n) is increased from 1 to 10, given four continental groups and the YCC panel (see main text for details). The 95% CI of each SNP combination was computed by resampling the same number of chromosomes from the populations and computing I_n 1,000 times.

**Figure 2**
STRUCTURE analysis of the YCC samples, with K=2, 3, or 4 groups, performed using genotypes of the 10 most informative SNPs ascertained using the genetic algorithm with the total YCC data. STRUCTURE analyses were computed using a model without admixture (A) and a model with admixture (B). Each analysis was repeated five times, after a Markov chain–Monte Carlo (MCMC) burning period of 50,000 and considering the next 200,000 MCMC iterations. In all five runs, good mixing was observed, and similar results were found in accordance with the model used. The natural logarithm of the estimated probability of the data (*lnp*) is as follows. In panel A, for K=2, *lnp*=-762.2; for K=3, *lnp*=-629.2; and, for K=4, *lnp*=-557.4. In panel B, for K=2, *lnp*=-764.9; for K=3, *lnp*=-631.2; and, for K=4, *lnp*=-559.5.

**Figure 3**
MDS plot based on the I_n matrix computed between pairs of populations by use of the genotypes of the 10 most informative SNPs in the 51 population samples from CEPH-HGDP. Four clusters of population can be identified: (i) sub-Saharan African populations, (ii) American populations, (iii) Eastern Asian and Oceanian populations, and (iv) European, Middle Eastern, North African, and Central/South Asian populations.

**Figure 4**
STRUCTURE analysis of the CEPH-HGDP samples, with K=2, 3, 4, or 5 groups, performed using genotypes of the 10 most informative SNPs ascertained using the genetic algorithm with the total YCC data. Two different STRUCTURE analyses were computed: a population model without admixture (A) and a population model with admixture (B). Each analysis was repeated five times after an MCMC burning period of 100,000 and considering the next 10,000 MCMC iterations. In all five runs, good mixing was observed, and similar results were found in accordance with the model used. The *lnp*, assuming K groups, is as follows. In panel A, for K=2, *lnp*=-11,801.2; for K=3, *lnp*=-10,977.3; for K=4, *lnp*=-10,279.2; and, for K=5, *lnp*=-10,324.9. In panel B, for K=2, *lnp*=-11,886.2; for K=3, *lnp*=-11,070.6; for K=4, *lnp*=-10,345.5; and, for K=5, *lnp*=-10,456.9. Cen. Af. Rep. = Central African Republic; S. Afr. = South Africa.

**Figure 5**
STRUCTURE analysis of each of the four groups detected in the HGDP-CEPH populations by previous STRUCTURE analysis (see main text) that considers models without admixture (A) and with admixture (B) and assumes K=2. A certain degree of population (sub)structure can be observed only in the case of American populations, but it disappears when three groups are considered (data not shown). Each analysis was repeated five times, after an MCMC burning period of 200,000 and considering the next 200,000 MCMC iterations. In all five runs, good mixing was observed, and similar results were found in accordance with the model used. The *lnp*, assuming K=2, is as follows. In panel A, for sub-Saharan Africa, *lnp*=-958.3; for America, *lnp*=-1,048.1; for East Asia and Oceania, *lnp*=-3,262.0; and, for Europe, the Middle East, Central/South Asia, and North-Africa, *lnp*=-5,321.5. In panel B, for sub-Saharan Africa, *lnp*=-946.7; for America, *lnp*=-1,057.4; for East Asia and Oceania, *lnp*=-3,263.5; and, for Europe, the Middle East, Central/South Asia, and North-Africa, *lnp*=-5,433.1.

**Figure 6**
BAPS 3.2 clustering results for K=2, 3, 4, and 5 groups in the HGDP-CEPH panel by use of the 10 most informative SNPs ascertained using the genetic algorithm with the YCC data. Each column represents an individual. The log (marginal likelihood) for K=2 groups is −11,687.5; for K=3, −10,832.6; for K=4, −10,164.8, and, for K=5, −10,024.32.

**Figure 7**
Sliding-window and haplotype analyses performed on the genomic region that includes SNP *rs952718* and the *ABCA12* gene. A, Sliding-window plot of the mean value observed for each window (the gene is represented by a black bar). B, Associated P value for comparison with an empirical distribution based on >10,000 genes (see main text). The P=.05 cutoff is represented by a black line. C, Bifurcation plots of the main core haplotypes in the three populations considered. D, Extended homozygosity versus genomic distance to the core haplotype. The region of the core haplotype was selected on the basis of the largest region that was statistically significant in the sliding-window analysis (from *rs6758257* to *rs6753310*; see main text for details).

**Figure 8**
Sliding-window and haplotype analyses performed on the genomic region that includes SNP *rs722869* and the *VRK1* gene. A, Sliding-window plot of the mean value observed for each window (the gene is represented by a black bar). B, Associated P value for comparison with an empirical distribution based on >10,000 genes (see main text). The P=.05 cutoff is represented by a black line. C, Bifurcation plots of the main core haplotypes in the three populations considered. D, Extended homozygosity versus genomic distance to the core haplotype. The region of the core haplotype was selected on the basis of the largest region that was statistically significant in the sliding-window analysis (from *rs1957137* to *rs17191471*; see main text for details).

**Figure 9**
Sliding-window and haplotype analyses performed on the genomic region that includes SNP *rs1858465. A,* Sliding-window plot of the mean value observed for each window. B, Associated P value for comparison with an empirical distribution based on >10,000 genes (see main text). The P=.05 cutoff is represented by a black line. C, Bifurcation plots of the main core haplotypes in the three populations considered. D, Extended homozygosity versus genomic distance to the core haplotype. The region of the core haplotype was selected on the basis of the largest region that was statistically significant in the sliding-window analysis (from *rs2137476* to *rs1398515*; see main text for details).

**Figure 10**
Sliding-window and haplotype analyses performed on the genomic region that includes SNP *rs1344870. A,* Sliding-window plot of the mean value observed for each window. B, Associated P value for comparison with an empirical distribution based on >10,000 genes (see main text). The P=.05 cutoff is represented by a black line. C, Bifurcation plots of the main core haplotypes in the three populations considered. D, Extended homozygosity versus genomic distance to the core haplotype. The region of the core haplotype was selected on the basis of the largest region that was statistically significant in the sliding-window analysis (from *rs2335092* to *rs1898300*; see main text for details).

**Figure 11**
Sliding-window and haplotype analyses performed on the genomic region that includes SNP *rs1876482* (1 of the 10 most informative SNPs identified), which is located in the *LOC442008* gene, by use of Perlegene data. A, Sliding-window plot of the mean value observed for each window (the gene is represented by a black bar). B, Associated P value for comparison with an empirical distribution based on >10,000 genes (see main text). The P=.05 cutoff is represented by a black line. C, Bifurcation plots of the main core haplotypes in the three populations considered. D, Extended homozygosity versus genomic distance to the core haplotype. The region of the core haplotype was selected on the basis of the largest region that was statistically significant in the sliding-window analysis (from *rs12619554* to *rs4832712*; see main text for details). Note the high frequency of the third haplotype in the case of Asian populations and the slow decay of the EHH of that haplotype compared with the other haplotypes both within and between populations.

See this image and copyright information in PMC

References

Web Resource

1. Affymetrix, http://www.affymetrix.com/index.affx

References

1. Bamshad M, Wooding S, Salisbury BA, Stephens JC (2004) Deconstructing the relationship between genetics and race. Nat Rev Genet 5:598–60910.1038/nrg1401 - DOI - PubMed
1. Bamshad M, Wooding SP (2003) Signatures of natural selection in the human genome. Nat Rev Genet 4:99–11110.1038/nrg999 - DOI - PubMed
1. Bamshad MJ, Wooding S, Watkins WS, Ostler CT, Batzer MA, Jorde LBV (2003) Human population genetic structure and inference of group membership. Am J Hum Genet 72:578–589 - PMC - PubMed
1. Barbujani G, Goldstein DB (2004) Africans and Asians abroad: genetic diversity in Europe. Annu Rev Genomics Hum Genet 5:119–15010.1146/annurev.genom.5.061903.180021 - DOI - PubMed
1. Bersaglieri T, Sabeti PC, Patterson N, Vanderploeg T, Schaffner SF, Drake JA, Rhodes M, Reich DE, Hirschhorn JN (2004) Genetic signatures of strong recent positive selection at the lactase gene. Am J Hum Genet 74:1111–1120 - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Proportioning whole-genome single-nucleotide-polymorphism diversity for the identification of geographic population structure and genetic ancestry

Affiliation

Proportioning whole-genome single-nucleotide-polymorphism diversity for the identification of geographic population structure and genetic ancestry

Authors

Affiliation

Abstract

Figures

References

Web Resource

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources