The sociolinguistic foundations of language modeling

Affiliations

PMID: 39871863
PMCID: PMC11770026
DOI: 10.3389/frai.2024.1472411

Review

The sociolinguistic foundations of language modeling

Jack Grieve et al. Front Artif Intell. 2025.

. 2025 Jan 13:7:1472411.

doi: 10.3389/frai.2024.1472411. eCollection 2024.

Authors

Affiliation

¹ Department of Linguistics and Communication, University of Birmingham, Birmingham, United Kingdom.

PMID: 39871863
PMCID: PMC11770026
DOI: 10.3389/frai.2024.1472411

Abstract

In this article, we introduce a sociolinguistic perspective on language modeling. We claim that language models in general are inherently modeling varieties of language, and we consider how this insight can inform the development and deployment of language models. We begin by presenting a technical definition of the concept of a variety of language as developed in sociolinguistics. We then discuss how this perspective could help us better understand five basic challenges in language modeling: social bias, domain adaptation, alignment, language change, and scale. We argue that to maximize the performance and societal value of language models it is important to carefully compile training corpora that accurately represent the specific varieties of language being modeled, drawing on theories, methods, and descriptions from the field of sociolinguistics.

Keywords: AI ethics; artificial intelligence; computational sociolinguistics; corpus linguistics; large language models; natural language processing; varieties of language.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The author(s) declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.

Figures

**Figure 1**
Varieties of language. This figure illustrates the concept of a variety of language, showing how the interaction between three distinct extra-linguistic factors—the social background of people who produce language (dialect), the social context in which language is produced (register), and the range of time over which language is produce (period)—can be used to specify a variety of language. It also illustrates how varieties of language are hierarchically organized, composed of smaller and smaller sub-varieties.

**Figure 2**
Representative corpus design. This figure presents a corpus as a representative sample of texts taken from a given variety of language (i.e., from a larger population of texts delimited by relevant extra-linguistic factors). This figure also illustrates how compiling a corpus that accurately represents a target variety requires access to an underlying model of that variety of language, including its internal sub-varieties, so that the corpus can be stratified so as to capture internal variation in that variety. Naive corpus compilation strategies that rely on convenience sampling will generally lead to less representative samples.

**Figure 3**
Sociolinguistic bias in language models. This figure illustrates how training language models on corpora that accurately represent the target variety of language including its internal structure, especially its constituent dialects, can potentially help address social bias, including both quality-of-service harms and stereotyping. This is exemplified by comparing two hypothetical models of American English, which are trained on corpora that inaccurately and accurately represent regional dialect variation (based on Grieve, 2016) in this variety of language.

**Figure 4**
Sociolinguistic adaptation of language models. This figure illustrates how an understanding of the sociolinguistic structure of varieties of languages can inform the adaptation of language models. Language model adaptation can be seen as the process of fine-tuning a base model, potentially in an iterative manner, to predict word tokens in a more narrowly defined variety of language that is subsumed by the larger variety of language represented by the base model.

See this image and copyright information in PMC

References

1. Achiam J., Adler S., Agarwal S., Ahmad L., Akkaya I., Aleman F. L., et al. . (2023). Gpt-4 technical report. arXiv [preprint] arXiv:2303.08774. 10.48550/arXiv.2303.08774 - DOI
1. Aggarwal D., Sathe A., Sitaram S. (2024). Exploring pretraining via active forgetting for improving cross lingual transfer for decoder language models. arXiv [preprint] arXiv:2410.16168. 10.48550/arXiv.2410.16168 - DOI
1. Aitken A. J. (1985). Is scots a language? English Today 1, 41–45. 10.1017/S0266078400001292 - DOI
1. Baack S. (2024). “A critical analysis of the largest source for generative ai training data: Common crawl,” in The 2024 ACM Conference on Fairness, Accountability, and Transparency (Rio de Janeiro: Association for Computing Machinery; ), 2199–2208.
1. Bahri Y., Dyer E., Kaplan J., Lee J., Sharma U. (2024). Explaining neural scaling laws. Proc. Nat. Acad. Sci. 121:e2311878121. 10.1073/pnas.2311878121 - DOI - PMC - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources
- Frontiers Media SA
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The sociolinguistic foundations of language modeling

Affiliation

The sociolinguistic foundations of language modeling

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources