Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jan 31;6(1):e16227.
doi: 10.1371/journal.pone.0016227.

Predictions of native American population structure using linguistic covariates in a hidden regression framework

Affiliations

Predictions of native American population structure using linguistic covariates in a hidden regression framework

Flora Jay et al. PLoS One. .

Abstract

Background: The mainland of the Americas is home to a remarkable diversity of languages, and the relationships between genes and languages have attracted considerable attention in the past. Here we investigate to which extent geography and languages can predict the genetic structure of Native American populations.

Methodology/principal findings: Our approach is based on a Bayesian latent cluster regression model in which cluster membership is explained by geographic and linguistic covariates. After correcting for geographic effects, we find that the inclusion of linguistic information improves the prediction of individual membership to genetic clusters. We further compare the predictive power of Greenberg's and The Ethnologue classifications of Amerindian languages. We report that The Ethnologue classification provides a better genetic proxy than Greenberg's classification at the stock and at the group levels. Although high predictive values can be achieved from The Ethnologue classification, we nevertheless emphasize that Choco, Chibchan and Tupi linguistic families do not exhibit a univocal correspondence with genetic clusters.

Conclusions/significance: The Bayesian latent class regression model described here is efficient at predicting population genetic structure using geographic and linguistic information in Native American populations.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Posterior distributions of the regression coefficients for a data set simulated with the hidden regression model ().
The dashed vertical lines correspond to the regression coefficients used for generating the data. Two spatial covariates (latitude and longitude) are included in the regression model but only the first one influences genetic structure.
Figure 2
Figure 2. Misclassification rates for simulated data as a function of the covariates included in the clustering algorithm.
A. The cluster memberships are influenced by latitude but not by longitude. B. The data are generated using latitude and a 5-level linguistic classification. C. The data are generated in a five-island model for which formula image or formula image.
Figure 3
Figure 3. Variable selection for simulated data.
The correlation coefficients formula image correspond to the correlations between the estimated and predicted membership probabilities. Confidence intervals of the correlation coefficients are estimated by assuming that the Fisher's transform formula image follows a Gaussian distribution . The validation scores are estimated with the 2-fold cross-validation method. Their standard deviations are estimated by using a non-parametric bootstrap method. A. The cluster memberships are influenced by latitude but not by longitude. B. The data are generated using latitude and a 5-level linguistic classification. C. The data are generated in a five-island model for which formula image.
Figure 4
Figure 4. Estimated and predicted population genetic structure for 28 Native American populations.
A. The membership coefficients are estimated in a model that includes spatial information (longitude, latitude). Inference of genetic structure is unchanged when we include additional linguistic covariates (Supporting Information Figure S1). The main differences between predictions obtained with or without linguistic information are framed in red. B-D. Membership coefficients predicted by Models B–D. The membership coefficients are averaged over individuals within the same linguistic unit.
Figure 5
Figure 5. Variable selection for the Native American HGDP data.
Geographic information includes longitude and latitude. Green. stands for Greenberg and Geog. stands for geography. The best model uses The Ethnologue linguistic classification.
Figure 6
Figure 6. Genetic structure of Native American populations as predicted by geographical covariates.
Geographical covariates include latitude, longitude, quadratic terms and an interaction term. Locations for which there is a cluster with a predicted membership coefficient larger than formula image are colored with the cluster color. Locations for which there is no cluster that reaches the formula image threshold or that are too distant from a sampled population are colored in grey. The barplot displays the membership probabilities as predicted by geographical covariates.

References

    1. Greenberg J, Turner CI, Zegura S. The settlement of the Americas: a comparison of the linguistic, dental, and genetic evidence. Curr Anthropol. 1986;27:477–97.
    1. Hunley K, Long JC. Gene flow across linguistic boundaries in Native North American populations. Proc Natl Acad Sci USA. 2005;102:1312–1317. - PMC - PubMed
    1. Bamshad M, Wooding S, Watkins W, Ostler C, Batzer MA, et al. Human population genetic structure and inference of group membership. Am J Hum Genet. 2003;72:578–89. - PMC - PubMed
    1. Tishkoff SA, Kidd KK. Implications of biogeography of human populations for ‘race’ and medicine. Nature Genetics. 2004;36:S21–S27. - PubMed
    1. Cavalli-Sforza LL, Menozzi P, Piazza A. Princeton University Press; 1994. The History and Geography of Human Genes.

Publication types