Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Mar 2:1-31.
doi: 10.1007/s10579-023-09640-9. Online ahead of print.

Regionalized models for Spanish language variations based on Twitter

Affiliations

Regionalized models for Spanish language variations based on Twitter

Eric S Tellez et al. Lang Resour Eval. .

Abstract

Spanish is one of the most spoken languages in the world. Its proliferation comes with variations in written and spoken communication among different regions. Understanding language variations can help improve model performances on regional tasks, such as those involving figurative language and local context information. This manuscript presents and describes a set of regionalized resources for the Spanish language built on 4-year Twitter public messages geotagged in 26 Spanish-speaking countries. We introduce word embeddings based on FastText, language models based on BERT, and per-region sample corpora. We also provide a broad comparison among regions covering lexical and semantical similarities and examples of using regional resources on message classification tasks.

Keywords: Linguistic resources; Semantic space; Spanish Twitter.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
The vocabulary growth and distribution of frequencies of 107 tokens over a sample of our Twitter’s Spanish language corpora
Fig. 2
Fig. 2
Distribution of tweets and tweeters labeled as Spanish-speaking users around the world. Colors are related to the logarithmic frequencies in data collected from 2016 to 2019 with the public Twitter API stream. Darker colors indicate a high population; the logarithmic scale implies that only significant frequency differences produce color changes
Fig. 3
Fig. 3
Ratio between the number of tweets produces by local tweeters and the total tweets in the country
Fig. 4
Fig. 4
Affinity matrix among Spanish regions’ vocabularies
Fig. 5
Fig. 5
Spanish-language lexical similarity visualization among country’s vocabularies through a two-dimensional UMAP projection using the Cosine among vocabularies. The points were colorized using a 3D UMAP projection (normalized and interpreted as RGB). Both projections use three nearest neighbors, which emphasizes local features
Fig. 6
Fig. 6
Regional Vocabulary in RGB representation
Fig. 7
Fig. 7
Most popular emojis per Spanish-speaking country
Fig. 8
Fig. 8
Number of common tokens shared by different countries or regions
Fig. 9
Fig. 9
Semantic similarities of our Spanish regional word embeddings. Countries are specified in their two letter ISO code. On the left, an affinity matrix where darker cells indicate higher similarities (small distances). On the right a two dimensional UMAP projection, near points indicate similarity
Fig. 10
Fig. 10
Geographic visualization of regional embeddings. The 3D UMAP projection is encoded as RGB
Fig. 11
Fig. 11
Loss and accuracy during training on the Masked Language Model task. The batch size is of 128 tweets
Fig. 12
Fig. 12
Comparison of the accuracy of the trained models on all the regions on MLM and emoticon prediction tasks
Fig. 13
Fig. 13
Two-dimensional UMAP projections of regional vocabularies (left side) and word embeddings (right side) removing CU, PR, GQ, and BR (regarding Figs. 5 and 9). Colors also capture similarity using a 3D UMAP projection of the same data. Nonetheless, the similarity between figures is undefined

References

    1. Alshutayri A, Atwell E. Exploring Twitter as a source of an Arabic dialect corpus. International Journal Of Computational Linguistics (IJCL) 2017;8:37–44.
    1. Anowar F, Sadaoui S, Selim B. Conceptual and empirical comparison of dimensionality reduction algorithms (PCA, KPCA, LDA, MDS, SVD, LLE, ISOMAP, LE, ICA, t-SNE) Computer Science Review. 2021;40:100378. doi: 10.1016/j.cosrev.2021.100378. - DOI
    1. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Transactions of The Association For Computational Linguistics. 2017;5:135–146. doi: 10.1162/tacl_a_00051. - DOI
    1. CKennedy, B., Atari, M., Davani, A. M., Yeh, L., Omrani, A., Kim, Y., Coombs, K., Havaldar, S., Portillo-Wightman, G., Gonzalez, E., & Hoover, J. (2022). Introducing the Gab Hate Corpus: defining and applying hate-based rhetoric to social media posts at scale. In Language Resources and Evaluation. Springer.
    1. Cotton E, Sharp J. Spanish in the Americas. Berlin: Georgetown University Press; 1988.

LinkOut - more resources