Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;7(1):e29901.
doi: 10.1371/journal.pone.0029901. Epub 2012 Jan 17.

Manifold learning for human population structure studies

Affiliations

Manifold learning for human population structure studies

Hoicheong Siu et al. PLoS One. 2012.

Abstract

The dimension of the population genetics data produced by next-generation sequencing platforms is extremely high. However, the "intrinsic dimensionality" of sequence data, which determines the structure of populations, is much lower. This motivates us to use locally linear embedding (LLE) which projects high dimensional genomic data into low dimensional, neighborhood preserving embedding, as a general framework for population structure and historical inference. To facilitate application of the LLE to population genetic analysis, we systematically investigate several important properties of the LLE and reveal the connection between the LLE and principal component analysis (PCA). Identifying a set of markers and genomic regions which could be used for population structure analysis will provide invaluable information for population genetics and association studies. In addition to identifying the LLE-correlated or PCA-correlated structure informative marker, we have developed a new statistic that integrates genomic information content in a genomic region for collectively studying its association with the population structure and LASSO algorithm to search such regions across the genomes. We applied the developed methodologies to a low coverage pilot dataset in the 1000 Genomes Project and a PHASE III Mexico dataset of the HapMap. We observed that 25.1%, 44.9% and 21.4% of the common variants and 89.2%, 92.4% and 75.1% of the rare variants were the LLE-correlated markers in CEU, YRI and ASI, respectively. This showed that rare variants, which are often private to specific populations, have much higher power to identify population substructure than common variants. The preliminary results demonstrated that next generation sequencing offers a rich resources and LLE provide a powerful tool for population structure analysis.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. The reconstructed graphs of 179 human genomes from three populations: YRI, CEU and ASI by the LLE algorithms assuming nearest neighbors.
Each node in the graph represents an individual. A pair of nodes associated with the non zero weight is connected.
Figure 2
Figure 2. The reconstructed graph of Mexican Americans in the Los Angeles, California (MEX) population data of HapMap phase III released in May 2010 with 86 individuals from 33 families.
A Father is represented with a large size square, a mother is represented with a large size circle and a child is represented with a small size circle. The number insight the squares and circles denotes the index of the family. The strong connections between individuals are represented by real lines and the weak connections between individuals is represented by dotted lines.
Figure 3
Figure 3. A reconstructed three generation French family from the CEPH dataset by the LLE where a rectangle represents a male and a circle represents a female, assuming k = 7.
The numbers next to each edge are the weights that reconstruct each of the data points from its neighbors in the LLE. The red real lines denote connections between the grandfathers and their children or between fathers and their children. The blue real lines denote connections between the grandmother and their children or between the mothers and their children. The green lines denote the connections between the grandchildren.
Figure 4
Figure 4. Low-dimensional coordinates in a non-linear dimensional projection of all 14,397,437 SNPs of 179 individuals from three populations mapped by the LLE where the X axis is the individual index and the Y axis is the eigenvalue arranged from large to small.
The color of the pixel at (X, Y) is the value of the eigenvector corresponding to eigenvalue at Y for individual X . Green color represents negative values, red color represents positive values and white color represents values close to zero.
Figure 5
Figure 5. Three eigenvectors in the eigenspace of zero eigenvalue for the nonlinear dimensional mapping of all 14,397,437 SNPs of 179 individuals from three populations mapped by the LLE define axes of populations: YRI, CEU and ASI.
Three intervals along the y-axis ranging from −0.15 to 0.15 were used to represent the values of the components in the eigenvectors (low dimensional representations) and the x-axis represents the individual index. The red color represents the YRI samples, green color represents the CEU samples and blue color represents the ASI samples.
Figure 6
Figure 6
A. Three eigenvectors in the eigenspace of zero eigenvalue for the nonlinear dimensional mapping of all 14,397,437 SNPs of 179 individuals from four populations YRI, CEU, CHB and JPT mapped by the LLE using the allele frequency weighted genetic distance. The corresponding coordinate of the eigenvector associated with individuals from YRI, CEU, and CHB and JPT (ASI) was mapped to the x axis, the y axis and z axis, respectively. B. Graphic representation of the first two PCs for 179 individuals from four populations YRI, CEU, CHB and JPT on 14,397,437 SNPs.
Figure 7
Figure 7. Eight eigenvectors for the nonlinear dimensional mapping of 374,434 SNPs on the autosomes of 1,397 individuals from 11 populations by the LLE where the X axis represents an individual index in each population and the eight eigenvectors (low dimensional representations) were placed along the Y axis.
Figure 8
Figure 8. DNA variation pattern of the LLE-correlated, CEU specific SNPs across populations.
Raw with star denotes rare alleles (MAF≤5%), raw without star denotes common alleles (MAF>5%); (LLE) minimum or (PCA) minimum indicates that SNPs are selected by the smallest p-value; (LLE) Bonferroni or (PCA) Bonferroni indicates that SNPs are selected by the p-values around the threshold of significance after Bonferroni correction for adjusting multiple testing; (LLE) maximum or (PCA) maximum indicates SNPs are selected by the highest p-value.
Figure 9
Figure 9. The distribution of structural informative genomic regions identified by the LASSO, LLE and PCA for CEU samples.
The genome was divided into nonoverlapping 250 kb bins (x axis), and the y axis represents the P-value for testing whether the genomic region is significantly structure informative.
Figure 10
Figure 10. Genotype frequency pattern of the SNPs within the structure significantly informative genome region located on chromosome 2 between 32,750 kb and 32,500 kb which was identified by the LLE and PCA methods, but non-significant by the LASSO method for CEU samples.
The red color represents the homozygous genotype with the main allele, green color represents the heterozygous genotype and blue color represents the homozygous genotype with the minor allele.
Figure 11
Figure 11. Genotype frequency pattern of the SNPs within the structure significantly informative genome region located on chromosome 3 between 164,500 kb and 164,750 kb which was identified by the LASSO method, but non-significant by the LLE and PCA method for CEU samples.
The red color represents the homozygous genotype with the main allele, green color represents the heterozygous genotype and blue color represents the homozygous genotype with the minor allele.
Figure 12
Figure 12. Genotype frequency pattern of the SNPs within the structure significantly informative genome region located on chromosome 1 between 27,750 kb and 28,000 kb under negative selection which was identified by all three methods: LASSO, LLE and PCA for CEU samples.
The red color represents the homozygous genotype with the main allele, green color represents the heterozygous genotype and blue color represents the homozygous genotype with the minor allele.
Figure 13
Figure 13. Genotype frequency pattern of the SNPs within the structure significantly informative genome region located on chromosome 10 between 111,250 kb and 111,500 kb under positive selection which was identified by all three methods: LASSO, LLE and PCA for CEU samples.
The red color represents the homozygous genotype with the main allele, green color represents the heterozygous genotype and blue color represents the homozygous genotype with the minor allele.
Figure 14
Figure 14. LD pattern in the structure significantly informative genome region for CEU samples, but not for YRI and ASI samples located on chromosome 3 between 164,500 kb and 164,750 kb which was identified by the LASSO method except the LLE or PCA method.
The LD levels were measured by pair-wise formula imageand illustrated by colors.

Similar articles

Cited by

References

    1. Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol. 2008;26:1135–1145. - PubMed
    1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Bignell HR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. - PMC - PubMed
    1. Drmanac R, Sparks AB, Callow MJ, Halpern AL, Fernandez A, et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2009;327:78–81. - PubMed
    1. Pool JE, Hellmann I, Jensen JD, Nielsen R. Population genetic inference from genomic sequence variation. Genome Res. 2010;20:291–300. - PMC - PubMed
    1. Menozzi P, Piazza A, Cavalli-Sforza L. Synthetic maps of human gene frequencies in europeans. Science. 1978;201:786–792. - PubMed

Publication types

MeSH terms