Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Dec 7;9(77):3323-8.
doi: 10.1098/rsif.2012.0491. Epub 2012 Jul 25.

Evolution of the most common English words and phrases over the centuries

Affiliations

Evolution of the most common English words and phrases over the centuries

Matjaz Perc. J R Soc Interface. .

Abstract

By determining the most common English words and phrases since the beginning of the sixteenth century, we obtain a unique large-scale view of the evolution of written text. We find that the most common words and phrases in any given year had a much shorter popularity lifespan in the sixteenth century than they had in the twentieth century. By measuring how their usage propagated across the years, we show that for the past two centuries, the process has been governed by linear preferential attachment. Along with the steady growth of the English lexicon, this provides an empirical explanation for the ubiquity of Zipf's law in language statistics and confirms that writing, although undoubtedly an expression of art and skill, is not immune to the same influences of self-organization that are known to regulate processes as diverse as the making of new friends and World Wide Web growth.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Confirmation of Zipf's law in the examined corpus. By measuring the frequency of 1-grams in the n-grams, where n > 2 (refer to key), we find that it is inversely proportional to the rank of the 1-grams. For all n, the depicted curves decay with a slope of −1 on a double log scale over several orders of magnitude, thus confirming the validity of Zipf's law in the examined dataset.
Figure 2.
Figure 2.
Evolution of popularity of the top 100 n-grams over the past five centuries. For each of the 5 starting years, being 1520, 1600, 1700, 1800 and 1900 from left to right (separated by dashed grey lines), the rank of the top 100 n-grams was followed until it exceeded 10 000 or until the end of the century. From top to bottom, the panels depict results for different n, as indicated vertically. The advent of the nineteenth century marks a turning point after which the rankings began to gain markedly on consistency. Regardless of which century is considered, the higher the n the more fleeting the popularity. Tables listing the top n-grams for all available years are available at http://www.matjazperc.com/ngrams.
Figure 3.
Figure 3.
‘Statistical’ coming of age of the English language. Symbols depict results for different n (refer to key), as obtained by calculating the average standard deviation of the rank for the top 1000 n-grams 100 years into the future. The thick grey line is a moving average over all the n-grams and over the analysis going 50 and 100 years into the future as well as backwards. There is a sharp transition to a greater maturity of the rankings taking place at around the year 1800. Although the moving average softens the transition, it confirm that the ‘statistical’ coming of age was taking place and that the nineteenth century was crucial in this respect.
Figure 4.
Figure 4.
Emergence of linear preferential attachment during the past two centuries. Based on the preceding evolution of popularity, two time periods were considered separately, as indicated in the figure legend. While preferential attachment appears to have been in place already during the 1520–1800 period, large deviations from the linear dependence (the goodness-of-fit is ≈0.05) hint towards inconsistencies that may have resulted in heavily fluctuated rankings. The same analysis for the nineteenth and the twentieth centuries provides much more conclusive results. For all n the data fall nicely onto straight lines (the goodness-of-fit is ≈0.8), thus indicating that continuous growth and linear preferential attachment have shaped the large-scale organization of the writing of English books over the past two centuries. Results for those n-grams that are not depicted are qualitatively identical for both periods of time.

Similar articles

Cited by

References

    1. Nowak M. A., Krakauer D. 1999. The evolution of language. Proc. Natl Acad. Sci. USA 96, 8028–803310.1073/pnas.96.14.8028 (doi:10.1073/pnas.96.14.8028) - DOI - DOI - PMC - PubMed
    1. Hauser M. D., Chomsky N., Fitch W. T. 2002. The faculty of language: what is it, who has it, and how did it evolve? Science 298, 1569–157910.1126/science.298.5598.1569 (doi:10.1126/science.298.5598.1569) - DOI - DOI - PubMed
    1. Nowak M. A., Komarova N. L., Niyogi P. 2002. Computational and evolutionary aspects of language. Nature 417, 611–61710.1038/nature00771 (doi:10.1038/nature00771) - DOI - DOI - PubMed
    1. Abrams D., Strogatz S. H. 2003. Modelling the dynamics of language death. Nature 424, 900.10.1038/424900a (doi:10.1038/424900a) - DOI - DOI - PubMed
    1. Solé R. V. 2005. Syntax for free? Nature 434, 289.10.1038/434289a (doi:10.1038/434289a) - DOI - DOI - PubMed

Publication types

LinkOut - more resources