Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Aug;157(4):630-40.
doi: 10.1002/ajpa.22758. Epub 2015 Jun 8.

Across language families: Genome diversity mirrors linguistic variation within Europe

Affiliations

Across language families: Genome diversity mirrors linguistic variation within Europe

Giuseppe Longobardi et al. Am J Phys Anthropol. 2015 Aug.

Abstract

Objectives: The notion that patterns of linguistic and biological variation may cast light on each other and on population histories dates back to Darwin's times; yet, turning this intuition into a proper research program has met with serious methodological difficulties, especially affecting language comparisons. This article takes advantage of two new tools of comparative linguistics: a refined list of Indo-European cognate words, and a novel method of language comparison estimating linguistic diversity from a universal inventory of grammatical polymorphisms, and hence enabling comparison even across different families. We corroborated the method and used it to compare patterns of linguistic and genomic variation in Europe.

Materials and methods: Two sets of linguistic distances, lexical and syntactic, were inferred from these data and compared with measures of geographic and genomic distance through a series of matrix correlation tests. Linguistic and genomic trees were also estimated and compared. A method (Treemix) was used to infer migration episodes after the main population splits.

Results: We observed significant correlations between genomic and linguistic diversity, the latter inferred from data on both Indo-European and non-Indo-European languages. Contrary to previous observations, on the European scale, language proved a better predictor of genomic differences than geography. Inferred episodes of genetic admixture following the main population splits found convincing correlates also in the linguistic realm.

Discussion: These results pave the ground for previously unfeasible cross-disciplinary analyses at the worldwide scale, encompassing populations of distant language families.

Keywords: genome-wide diversity; human evolutionary history; parametric comparison method; single-nucleotide polymorphisms.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Geographic distribution of the samples considered in this study. Indo‐European‐speaking populations in blue, populations speaking Finno‐Ugric languages (Hungarian, Finnish) and the linguistic isolate (Basque) in red.
Figure 2
Figure 2
UPGMA trees summarizing population relationships. Distances inferred from: (A) lexical and (B) syntactic comparisons among 12 Indo‐European‐speaking European populations; (C) syntactic comparisons among 15 European languages, and (D) F ST distances among 15 populations sharing 177,949 SNPs. Lexical distances were estimated from lists of cognate words, amounting to over 6,000 roots (http://ielex.mpi.nl/); syntactic distances were measured over 56 parameters of nominal phrases (http://dx.doi.org/10.1075/jhl.3.1.07lon.additional). In (D), numbers indicate the support of the branching after 100 bootstrap replicates. The matrix perturbation techniques usable to test the robustness of trees (bootstrapping and jackknifing) provide stable topologies, but owing to the small number of characters involved they are only relatively reliable (cf. Longobardi et al., 2013 for more details). Therefore, bootstrapping scores have been only reported here for the genetic tree D.
Figure 3
Figure 3
Projection on two dimensions of the main components (PCA) of linguistic (A) and individual genomic (B) variation. The linguistic PCA was performed using the R FactoMineR program, with neutralized parameter values coded as “NA,” whereas the genomic PCA was calculated with the R SNPRelate package (Lê et al., 2008). Note that the linguistic scatter diagram accounts for a fraction of the total variance that is >25‐fold as large as that accounted for by the genomic scatter diagram.
Figure 4
Figure 4
Unsupervised ancestry‐inference analysis based on the software ADMIXTURE. Each individual genotype is represented by a column in the area representing the appropriate population, and colors correspond to the fraction of the genotype that can be attributed to each of the K groups (2 ≤ K ≤ 5) assumed to have contributed to the populations' ancestry.
Figure 5
Figure 5
Maximum‐likelihood population trees. The algorithm chosen, TreeMix (28), estimates phylogenetic relationships with (A) three, (B) one, and (C) two superimposed migration events after the main population splits.

Similar articles

Cited by

References

    1. 1000 Genomes Project Consortium. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491:56–65. - PMC - PubMed
    1. Alexander DH, Novembre J, Lange K. 2009. Fast model‐based estimation of ancestry in unrelated individuals. Genome Res 19:1655–1664. - PMC - PubMed
    1. Alonso S, Flores C, Cabrera V, Alonso A, Martín P, Albarrán C, Izagirre N, de la Rúa C, García O. 2005. The place of the basques in the European Y‐chromosome diversity landscape. Eur J Hum Genet 13:1293‐1302. - PubMed
    1. Baker M. 2001. The atoms of language. New York: Basic Books.
    1. Barbujani G, Colonna V. 2010. Human genome diversity: frequently asked questions. Trends Genet 26:285–295. - PubMed

Publication types

LinkOut - more resources