On the fractal patterns of language structures
- PMID: 37200318
- PMCID: PMC10194960
- DOI: 10.1371/journal.pone.0285630
On the fractal patterns of language structures
Abstract
Natural Language Processing (NLP) makes use of Artificial Intelligence algorithms to extract meaningful information from unstructured texts, i.e., content that lacks metadata and cannot easily be indexed or mapped onto standard database fields. It has several applications, from sentiment analysis and text summary to automatic language translation. In this work, we use NLP to figure out similar structural linguistic patterns among several different languages. We apply the word2vec algorithm that creates a vector representation for the words in a multidimensional space that maintains the meaning relationship between the words. From a large corpus we built this vectorial representation in a 100-dimensional space for English, Portuguese, German, Spanish, Russian, French, Chinese, Japanese, Korean, Italian, Arabic, Hebrew, Basque, Dutch, Swedish, Finnish, and Estonian. Then, we calculated the fractal dimensions of the structure that represents each language. The structures are multi-fractals with two different dimensions that we use, in addition to the token-dictionary size rate of the languages, to represent the languages in a three-dimensional space. Finally, analyzing the distance among languages in this space, we conclude that the closeness there is tendentially related to the distance in the Phylogenetic tree that depicts the lines of evolutionary descent of the languages from a common ancestor.
Copyright: © 2023 Ribeiro et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Conflict of interest statement
The authors have declared that no competing interests exist.
Figures











Similar articles
-
Neural machine translation of clinical texts between long distance languages.J Am Med Inform Assoc. 2019 Dec 1;26(12):1478-1487. doi: 10.1093/jamia/ocz110. J Am Med Inform Assoc. 2019. PMID: 31334764 Free PMC article.
-
Building lexicon-based sentiment analysis model for low-resource languages.MethodsX. 2023 Oct 22;11:102460. doi: 10.1016/j.mex.2023.102460. eCollection 2023 Dec. MethodsX. 2023. PMID: 38023300 Free PMC article.
-
Inventory of tools for Dutch clinical language processing.Stud Health Technol Inform. 2012;180:245-9. Stud Health Technol Inform. 2012. PMID: 22874189
-
Clinical Natural Language Processing in languages other than English: opportunities and challenges.J Biomed Semantics. 2018 Mar 30;9(1):12. doi: 10.1186/s13326-018-0179-8. J Biomed Semantics. 2018. PMID: 29602312 Free PMC article. Review.
-
Essential Elements of Natural Language Processing: What the Radiologist Should Know.Acad Radiol. 2020 Jan;27(1):6-12. doi: 10.1016/j.acra.2019.08.010. Epub 2019 Sep 17. Acad Radiol. 2020. PMID: 31537505 Review.
References
-
- Corballis MC. The Truth about Language: What It Is and Where It Came From. University of Chicago Press; 2021. Available from: 10.7208/9780226287225. - DOI
-
- Berwick RC, Chomsky N. Why Only Us: Language and Evolution. The MIT Press; 2015.
-
- Wu MS, Schweikhard NE, Bodt TA, Hill NW, List JM. Computer-Assisted Language Comparison: State of the Art. Journal of Open Humanities Data. 2020;6(2). doi: 10.5334/johd.12 - DOI
-
- Jäger G. Computational historical linguistics. Theoretical Linguistics. 2019;45(3-4):151–182. doi: 10.1515/tl-2019-0011 - DOI
-
- Ponti EM, O’Horan H, Berzak Y, Vulić I, Reichart R, Poibeau T, et al. Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing. Computational Linguistics. 2019;45(3):559–601. doi: 10.1162/coli_a_00357 - DOI
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources