. 2011 Jan 31;6(1):e16227.

doi: 10.1371/journal.pone.0016227.

Predictions of native American population structure using linguistic covariates in a hidden regression framework

Flora Jay¹, Olivier François, Michael G B Blum

Affiliations

Affiliation

¹ Laboratoire des Techniques de l'Ingénierie Médicale et de la Complexité, Equipe Biologie Computationnelle et Mathématique, Faculté de Médecine, Université Joseph Fourier, Grenoble, Centre National de la Recherche Scientifique, La Tronche, France.

PMID: 21305006
PMCID: PMC3031544
DOI: 10.1371/journal.pone.0016227

Predictions of native American population structure using linguistic covariates in a hidden regression framework

Flora Jay et al. PLoS One. 2011.

. 2011 Jan 31;6(1):e16227.

doi: 10.1371/journal.pone.0016227.

Authors

Flora Jay¹, Olivier François, Michael G B Blum

Affiliation

¹ Laboratoire des Techniques de l'Ingénierie Médicale et de la Complexité, Equipe Biologie Computationnelle et Mathématique, Faculté de Médecine, Université Joseph Fourier, Grenoble, Centre National de la Recherche Scientifique, La Tronche, France.

PMID: 21305006
PMCID: PMC3031544
DOI: 10.1371/journal.pone.0016227

Abstract

Background: The mainland of the Americas is home to a remarkable diversity of languages, and the relationships between genes and languages have attracted considerable attention in the past. Here we investigate to which extent geography and languages can predict the genetic structure of Native American populations.

Methodology/principal findings: Our approach is based on a Bayesian latent cluster regression model in which cluster membership is explained by geographic and linguistic covariates. After correcting for geographic effects, we find that the inclusion of linguistic information improves the prediction of individual membership to genetic clusters. We further compare the predictive power of Greenberg's and The Ethnologue classifications of Amerindian languages. We report that The Ethnologue classification provides a better genetic proxy than Greenberg's classification at the stock and at the group levels. Although high predictive values can be achieved from The Ethnologue classification, we nevertheless emphasize that Choco, Chibchan and Tupi linguistic families do not exhibit a univocal correspondence with genetic clusters.

Conclusions/significance: The Bayesian latent class regression model described here is efficient at predicting population genetic structure using geographic and linguistic information in Native American populations.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Posterior distributions of the regression coefficients for a data set simulated with the hidden regression model ().**
The dashed vertical lines correspond to the regression coefficients used for generating the data. Two spatial covariates (latitude and longitude) are included in the regression model but only the first one influences genetic structure.

**Figure 2. Misclassification rates for simulated data as a function of the covariates included in the clustering algorithm.**
A. The cluster memberships are influenced by latitude but not by longitude. B. The data are generated using latitude and a 5-level linguistic classification. C. The data are generated in a five-island model for which or .

formula image — **Figure 2. Misclassification rates for simulated data as a function of the covariates included in the clustering algorithm.**
A. The cluster memberships are influenced by latitude but not by longitude. B. The data are generated using latitude and a 5-level linguistic classification. C. The data are generated in a five-island model for which or .

**Figure 3. Variable selection for simulated data.**
The correlation coefficients correspond to the correlations between the estimated and predicted membership probabilities. Confidence intervals of the correlation coefficients are estimated by assuming that the Fisher's transform follows a Gaussian distribution . The validation scores are estimated with the 2-fold cross-validation method. Their standard deviations are estimated by using a non-parametric bootstrap method. A. The cluster memberships are influenced by latitude but not by longitude. B. The data are generated using latitude and a 5-level linguistic classification. C. The data are generated in a five-island model for which .

**Figure 4. Estimated and predicted population genetic structure for 28 Native American populations.**
A. The membership coefficients are estimated in a model that includes spatial information (longitude, latitude). Inference of genetic structure is unchanged when we include additional linguistic covariates (Supporting Information Figure S1). The main differences between predictions obtained with or without linguistic information are framed in red. B-D. Membership coefficients predicted by Models B–D. The membership coefficients are averaged over individuals within the same linguistic unit.

**Figure 5. Variable selection for the Native American HGDP data.**
Geographic information includes longitude and latitude. Green. stands for Greenberg and Geog. stands for geography. The best model uses *The Ethnologue* linguistic classification.

**Figure 6. Genetic structure of Native American populations as predicted by geographical covariates.**
Geographical covariates include latitude, longitude, quadratic terms and an interaction term. Locations for which there is a cluster with a predicted membership coefficient larger than are colored with the cluster color. Locations for which there is no cluster that reaches the threshold or that are too distant from a sampled population are colored in grey. The barplot displays the membership probabilities as predicted by geographical covariates.

See this image and copyright information in PMC

References

1. Greenberg J, Turner CI, Zegura S. The settlement of the Americas: a comparison of the linguistic, dental, and genetic evidence. Curr Anthropol. 1986;27:477–97.
1. Hunley K, Long JC. Gene flow across linguistic boundaries in Native North American populations. Proc Natl Acad Sci USA. 2005;102:1312–1317. - PMC - PubMed
1. Bamshad M, Wooding S, Watkins W, Ostler C, Batzer MA, et al. Human population genetic structure and inference of group membership. Am J Hum Genet. 2003;72:578–89. - PMC - PubMed
1. Tishkoff SA, Kidd KK. Implications of biogeography of human populations for ‘race’ and medicine. Nature Genetics. 2004;36:S21–S27. - PubMed
1. Cavalli-Sforza LL, Menozzi P, Piazza A. Princeton University Press; 1994. The History and Geography of Human Genes.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Predictions of native American population structure using linguistic covariates in a hidden regression framework

Affiliation

Predictions of native American population structure using linguistic covariates in a hidden regression framework

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Miscellaneous