Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Feb 7;277(1680):429-36.
doi: 10.1098/rspb.2009.1513. Epub 2009 Oct 7.

Words as alleles: connecting language evolution with Bayesian learners to models of genetic drift

Affiliations

Words as alleles: connecting language evolution with Bayesian learners to models of genetic drift

Florencia Reali et al. Proc Biol Sci. .

Abstract

Scientists studying how languages change over time often make an analogy between biological and cultural evolution, with words or grammars behaving like traits subject to natural selection. Recent work has exploited this analogy by using models of biological evolution to explain the properties of languages and other cultural artefacts. However, the mechanisms of biological and cultural evolution are very different: biological traits are passed between generations by genes, while languages and concepts are transmitted through learning. Here we show that these different mechanisms can have the same results, demonstrating that the transmission of frequency distributions over variants of linguistic forms by Bayesian learners is equivalent to the Wright-Fisher model of genetic drift. This simple learning mechanism thus provides a justification for the use of models of genetic drift in studying language evolution. In addition to providing an explicit connection between biological and cultural evolution, this allows us to define a 'neutral' model that indicates how languages can change in the absence of selection at the level of linguistic variants. We demonstrate that this neutral model can account for three phenomena: the s-shaped curve of language change, the distribution of word frequencies, and the relationship between word frequencies and extinction rates.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
(a) Iterated learning: each learner sees data—i.e. utterances—produced by the previous learner, forms a hypothesis about the distribution from which the data were produced, and uses this hypothesis to produce the data that will be supplied to the next learner. (b) Prior distribution of θ1 for the case of two competing variants (K = 2), for values of (i) α/2 = 0.1, (ii) α/2 = 1, (iii) α/2 = 5. When α/2 = 1 the density function is simply a uniform distribution. When α/2 < 1 the prior is such that most of the probability mass is in the extremes of the distribution, favouring the ‘regularization’ of languages towards deterministic rules. When α/2 > 1, the learner tends to weight both variants equally, expecting languages to display probabilistic variation. (c) The effects of these expectations on the evolution of frequencies for values of α/2 indicated at the top of each column. Each panel shows changes in the probability distribution of one of the two variants (v1) (horizontal axis) over five iterations (vertical axis). The frequency of v1 was initialized at x1 = 5 from a total frequency of N = 10. White cells have zero probability, darker grey indicates higher probability.
Figure 2.
Figure 2.
Changes in the probability (vertical axis) of a new variant (v1) over 50 iterations of learning (horizontal axis) as a function of the value of α. Total frequency of v1 and v2 was N = 50, but the same effects are observed with larger values of N. (a) Changes in the probability of v1 using α = 0.05 corresponding to a prior that favours regularization. (i) probability changes when conditioning on initial frequency only (x1 = 0), (ii) changes in the probability of v1 when conditioning on both initial frequency (x1 = 0) and final frequency (x1 = 50), corresponding to the situation when the new variant (v1) eventually takes over the language. Under these conditions, s-shaped curves are observed, consistent with historical linguistic data. (b) Changes in the probability of v1 when α = 10 is used, corresponding to a prior that favours probabilistic variation. (ii) Case conditioning on initial frequency only (x1 = 0), (i) case of conditioning on both initial (x1 = 0) and final (x1 = 50) frequencies, illustrating that the appearance of the s-shaped curve depends on the expectations of the learners. White cells have zero probability, darker grey indicates higher probability.
Figure 3.
Figure 3.
(a) Power-law distribution on word frequencies obtained from corpus data, consisting of N = 33 399 word tokens. Horizontal axis corresponds to word frequency (xk) in log scale, and the vertical axis corresponds to the probability p(xk) that a certain word-type would fall within the bin at that frequency level. A power-law distribution is indicated by a linear relationship with slope γ = 1.70 (Bernstein-Ratner corpus). (b) Iterated learning using a two-parameter Poisson–Dirichlet distribution as a prior on distributions over infinitely many variants also produces a power-law relationship with γ = 1.74. Simulations were implemented by sampling over a population of arbitrarily assigned 33 399 numerical word tokens to match the size of the corpus. Frequencies were initialized by setting all word tokens to the same unique type. The frequency distribution stabilized after 10 000 iterations of learning, and the result shown here reflect the distribution produced by a single learner after 20 000 iterations. We ran the simulations across a range of values of δ (from 0.1 to 1, with steps of 0.1), and the value of α was set to 10 (see the electronic supplementary material, for details). Simulations with δ = 0.3 produced the closest match to the corpus data, and this is the case shown in the figure model. (c) Initial lexical frequency xk plotted against the replacement rate, estimated as r = 1/t, where t is the number of iterations before absorption (i.e. xk = 0). For each frequency value, time of absorption was directly measured over 5000 iterations after frequencies reached a steady state. The resulting linear relationship on a log–log plot reflects an underlying power law with γ = 0.8 (the correlation between log frequency and log replacement rate is r = −0.81, p < 0.00001).

Comment in

References

    1. Bailey C. J.1973Variation and linguistic theory. Washington, DC: Center for Applied Linguistics
    1. Batali J.1998Computational simulations of the emergence of grammar. In Approaches to the evolution of language: social and cognitive bases (eds. Hurford J., Studdert-Kennedy M.), Cambridge, UK: Cambridge University Press
    1. Baxter G. J., Blythe R. A., Croft W., McKane A. J.2006Utterance selection model of language change. Phys. Rev. E 73, 046118 (doi:10.1103/PhysRevE.73.046118) - DOI - PubMed
    1. Bentley R., Hahn M. W., Shennan S. J.2004Random drift and culture change. Proc. R. Soc. Lond. B 271, 1443–1450 (doi:10.1098/rspb.2004.2746) - DOI - PMC - PubMed
    1. Bernardo J. M., Smith A. F. M.1994Bayesian theory. Chichester, UK: Wiley

Publication types

LinkOut - more resources