Words as alleles: connecting language evolution with Bayesian learners to models of genetic drift

Florencia Reali¹, Thomas L Griffiths

Affiliations

PMID: 19812077
PMCID: PMC2842651
DOI: 10.1098/rspb.2009.1513

Words as alleles: connecting language evolution with Bayesian learners to models of genetic drift

Florencia Reali et al. Proc Biol Sci. 2010.

. 2010 Feb 7;277(1680):429-36.

doi: 10.1098/rspb.2009.1513. Epub 2009 Oct 7.

Authors

Florencia Reali¹, Thomas L Griffiths

Affiliation

¹ Department of Psychology, 3210 Tolman Hall, MC 1650, University of California at Berkeley, Berkeley, CA 94720-1650, USA. florencia.reali@gmail.com

PMID: 19812077
PMCID: PMC2842651
DOI: 10.1098/rspb.2009.1513

Abstract

Scientists studying how languages change over time often make an analogy between biological and cultural evolution, with words or grammars behaving like traits subject to natural selection. Recent work has exploited this analogy by using models of biological evolution to explain the properties of languages and other cultural artefacts. However, the mechanisms of biological and cultural evolution are very different: biological traits are passed between generations by genes, while languages and concepts are transmitted through learning. Here we show that these different mechanisms can have the same results, demonstrating that the transmission of frequency distributions over variants of linguistic forms by Bayesian learners is equivalent to the Wright-Fisher model of genetic drift. This simple learning mechanism thus provides a justification for the use of models of genetic drift in studying language evolution. In addition to providing an explicit connection between biological and cultural evolution, this allows us to define a 'neutral' model that indicates how languages can change in the absence of selection at the level of linguistic variants. We demonstrate that this neutral model can account for three phenomena: the s-shaped curve of language change, the distribution of word frequencies, and the relationship between word frequencies and extinction rates.

PubMed Disclaimer

Figures

**Figure 1.**
(a) Iterated learning: each learner sees data—i.e. utterances—produced by the previous learner, forms a hypothesis about the distribution from which the data were produced, and uses this hypothesis to produce the data that will be supplied to the next learner. (b) Prior distribution of θ₁ for the case of two competing variants (K = 2), for values of (i) α/2 = 0.1, (ii) α/2 = 1, (iii) α/2 = 5. When α/2 = 1 the density function is simply a uniform distribution. When α/2 < 1 the prior is such that most of the probability mass is in the extremes of the distribution, favouring the ‘regularization’ of languages towards deterministic rules. When α/2 > 1, the learner tends to weight both variants equally, expecting languages to display probabilistic variation. (c) The effects of these expectations on the evolution of frequencies for values of α/2 indicated at the top of each column. Each panel shows changes in the probability distribution of one of the two variants (v₁) (horizontal axis) over five iterations (vertical axis). The frequency of v₁ was initialized at x₁ = 5 from a total frequency of N = 10. White cells have zero probability, darker grey indicates higher probability.

**Figure 2.**
Changes in the probability (vertical axis) of a new variant (v₁) over 50 iterations of learning (horizontal axis) as a function of the value of α. Total frequency of v₁ and v₂ was N = 50, but the same effects are observed with larger values of N. (a) Changes in the probability of v₁ using α = 0.05 corresponding to a prior that favours regularization. (i) probability changes when conditioning on initial frequency only (x₁ = 0), (ii) changes in the probability of v₁ when conditioning on both initial frequency (x₁ = 0) and final frequency (x₁ = 50), corresponding to the situation when the new variant (v₁) eventually takes over the language. Under these conditions, s-shaped curves are observed, consistent with historical linguistic data. (b) Changes in the probability of v₁ when α = 10 is used, corresponding to a prior that favours probabilistic variation. (ii) Case conditioning on initial frequency only (x₁ = 0), (i) case of conditioning on both initial (x₁ = 0) and final (x₁ = 50) frequencies, illustrating that the appearance of the s-shaped curve depends on the expectations of the learners. White cells have zero probability, darker grey indicates higher probability.

**Figure 3.**
(a) Power-law distribution on word frequencies obtained from corpus data, consisting of N = 33 399 word tokens. Horizontal axis corresponds to word frequency (x_k) in log scale, and the vertical axis corresponds to the probability p(x_k) that a certain word-type would fall within the bin at that frequency level. A power-law distribution is indicated by a linear relationship with slope γ = 1.70 (Bernstein-Ratner corpus). (b) Iterated learning using a two-parameter Poisson–Dirichlet distribution as a prior on distributions over infinitely many variants also produces a power-law relationship with γ = 1.74. Simulations were implemented by sampling over a population of arbitrarily assigned 33 399 numerical word tokens to match the size of the corpus. Frequencies were initialized by setting all word tokens to the same unique type. The frequency distribution stabilized after 10 000 iterations of learning, and the result shown here reflect the distribution produced by a single learner after 20 000 iterations. We ran the simulations across a range of values of δ (from 0.1 to 1, with steps of 0.1), and the value of α was set to 10 (see the electronic supplementary material, for details). Simulations with δ = 0.3 produced the closest match to the corpus data, and this is the case shown in the figure model. (c) Initial lexical frequency x_k plotted against the replacement rate, estimated as r = 1/t, where t is the number of iterations before absorption (i.e. x_k = 0). For each frequency value, time of absorption was directly measured over 5000 iterations after frequencies reached a steady state. The resulting linear relationship on a log–log plot reflects an underlying power law with γ = 0.8 (the correlation between log frequency and log replacement rate is r = −0.81, p < 0.00001).

See this image and copyright information in PMC

Comment in

Population-level neutral model already explains linguistic patterns.
Bentley RA, Ormerod P, Shennan S. Bentley RA, et al. Proc Biol Sci. 2011 Jun 22;278(1713):1770-2; discussion 1773-6. doi: 10.1098/rspb.2010.2581. Epub 2011 Mar 23. Proc Biol Sci. 2011. PMID: 21429929 Free PMC article. No abstract available.

References

1. Bailey C. J.1973Variation and linguistic theory. Washington, DC: Center for Applied Linguistics
1. Batali J.1998Computational simulations of the emergence of grammar. In Approaches to the evolution of language: social and cognitive bases (eds. Hurford J., Studdert-Kennedy M.), Cambridge, UK: Cambridge University Press
1. Baxter G. J., Blythe R. A., Croft W., McKane A. J.2006Utterance selection model of language change. Phys. Rev. E 73, 046118 (doi:10.1103/PhysRevE.73.046118) - DOI - PubMed
1. Bentley R., Hahn M. W., Shennan S. J.2004Random drift and culture change. Proc. R. Soc. Lond. B 271, 1443–1450 (doi:10.1098/rspb.2004.2746) - DOI - PMC - PubMed
1. Bernardo J. M., Smith A. F. M.1994Bayesian theory. Chichester, UK: Wiley

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Words as alleles: connecting language evolution with Bayesian learners to models of genetic drift

Affiliation

Words as alleles: connecting language evolution with Bayesian learners to models of genetic drift

Authors

Affiliation

Abstract

Figures

Comment in

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources