Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 May 18;18(5):e0285630.
doi: 10.1371/journal.pone.0285630. eCollection 2023.

On the fractal patterns of language structures

Affiliations

On the fractal patterns of language structures

Leonardo Costa Ribeiro et al. PLoS One. .

Abstract

Natural Language Processing (NLP) makes use of Artificial Intelligence algorithms to extract meaningful information from unstructured texts, i.e., content that lacks metadata and cannot easily be indexed or mapped onto standard database fields. It has several applications, from sentiment analysis and text summary to automatic language translation. In this work, we use NLP to figure out similar structural linguistic patterns among several different languages. We apply the word2vec algorithm that creates a vector representation for the words in a multidimensional space that maintains the meaning relationship between the words. From a large corpus we built this vectorial representation in a 100-dimensional space for English, Portuguese, German, Spanish, Russian, French, Chinese, Japanese, Korean, Italian, Arabic, Hebrew, Basque, Dutch, Swedish, Finnish, and Estonian. Then, we calculated the fractal dimensions of the structure that represents each language. The structures are multi-fractals with two different dimensions that we use, in addition to the token-dictionary size rate of the languages, to represent the languages in a three-dimensional space. Finally, analyzing the distance among languages in this space, we conclude that the closeness there is tendentially related to the distance in the Phylogenetic tree that depicts the lines of evolutionary descent of the languages from a common ancestor.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Integer-dimension objects.
Fig 2
Fig 2. Fractional-dimension object: Koch curve.
Fig 3
Fig 3. Word2vec neural network.
Fig 4
Fig 4. Vector representation of words by word2vec.
Fig 5
Fig 5. Box counting.
Fig 6
Fig 6. Calculation of the fractal-dimension by the box-counting algorithm.
Fig 7
Fig 7. Representation of the languages in a bi-dimensional space with the longer-scale fractal dimension.
Fig 8
Fig 8. Representation of the languages in a bi-dimensional space with the shorter-scale fractal dimension.
Fig 9
Fig 9. Top-50 biggest clusters of Russian.
Fig 10
Fig 10. Top-50 biggest clusters of Hebrew.
Fig 11
Fig 11. Cluster size ranking.

Similar articles

References

    1. Corballis MC. The Truth about Language: What It Is and Where It Came From. University of Chicago Press; 2021. Available from: 10.7208/9780226287225. - DOI
    1. Berwick RC, Chomsky N. Why Only Us: Language and Evolution. The MIT Press; 2015.
    1. Wu MS, Schweikhard NE, Bodt TA, Hill NW, List JM. Computer-Assisted Language Comparison: State of the Art. Journal of Open Humanities Data. 2020;6(2). doi: 10.5334/johd.12 - DOI
    1. Jäger G. Computational historical linguistics. Theoretical Linguistics. 2019;45(3-4):151–182. doi: 10.1515/tl-2019-0011 - DOI
    1. Ponti EM, O’Horan H, Berzak Y, Vulić I, Reichart R, Poibeau T, et al. Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing. Computational Linguistics. 2019;45(3):559–601. doi: 10.1162/coli_a_00357 - DOI

Publication types