Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2025 Jan 13:7:1472411.
doi: 10.3389/frai.2024.1472411. eCollection 2024.

The sociolinguistic foundations of language modeling

Affiliations
Review

The sociolinguistic foundations of language modeling

Jack Grieve et al. Front Artif Intell. .

Abstract

In this article, we introduce a sociolinguistic perspective on language modeling. We claim that language models in general are inherently modeling varieties of language, and we consider how this insight can inform the development and deployment of language models. We begin by presenting a technical definition of the concept of a variety of language as developed in sociolinguistics. We then discuss how this perspective could help us better understand five basic challenges in language modeling: social bias, domain adaptation, alignment, language change, and scale. We argue that to maximize the performance and societal value of language models it is important to carefully compile training corpora that accurately represent the specific varieties of language being modeled, drawing on theories, methods, and descriptions from the field of sociolinguistics.

Keywords: AI ethics; artificial intelligence; computational sociolinguistics; corpus linguistics; large language models; natural language processing; varieties of language.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The author(s) declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.

Figures

Figure 1
Figure 1
Varieties of language. This figure illustrates the concept of a variety of language, showing how the interaction between three distinct extra-linguistic factors—the social background of people who produce language (dialect), the social context in which language is produced (register), and the range of time over which language is produce (period)—can be used to specify a variety of language. It also illustrates how varieties of language are hierarchically organized, composed of smaller and smaller sub-varieties.
Figure 2
Figure 2
Representative corpus design. This figure presents a corpus as a representative sample of texts taken from a given variety of language (i.e., from a larger population of texts delimited by relevant extra-linguistic factors). This figure also illustrates how compiling a corpus that accurately represents a target variety requires access to an underlying model of that variety of language, including its internal sub-varieties, so that the corpus can be stratified so as to capture internal variation in that variety. Naive corpus compilation strategies that rely on convenience sampling will generally lead to less representative samples.
Figure 3
Figure 3
Sociolinguistic bias in language models. This figure illustrates how training language models on corpora that accurately represent the target variety of language including its internal structure, especially its constituent dialects, can potentially help address social bias, including both quality-of-service harms and stereotyping. This is exemplified by comparing two hypothetical models of American English, which are trained on corpora that inaccurately and accurately represent regional dialect variation (based on Grieve, 2016) in this variety of language.
Figure 4
Figure 4
Sociolinguistic adaptation of language models. This figure illustrates how an understanding of the sociolinguistic structure of varieties of languages can inform the adaptation of language models. Language model adaptation can be seen as the process of fine-tuning a base model, potentially in an iterative manner, to predict word tokens in a more narrowly defined variety of language that is subsumed by the larger variety of language represented by the base model.

References

    1. Achiam J., Adler S., Agarwal S., Ahmad L., Akkaya I., Aleman F. L., et al. . (2023). Gpt-4 technical report. arXiv [preprint] arXiv:2303.08774. 10.48550/arXiv.2303.08774 - DOI
    1. Aggarwal D., Sathe A., Sitaram S. (2024). Exploring pretraining via active forgetting for improving cross lingual transfer for decoder language models. arXiv [preprint] arXiv:2410.16168. 10.48550/arXiv.2410.16168 - DOI
    1. Aitken A. J. (1985). Is scots a language? English Today 1, 41–45. 10.1017/S0266078400001292 - DOI
    1. Baack S. (2024). “A critical analysis of the largest source for generative ai training data: Common crawl,” in The 2024 ACM Conference on Fairness, Accountability, and Transparency (Rio de Janeiro: Association for Computing Machinery; ), 2199–2208.
    1. Bahri Y., Dyer E., Kaplan J., Lee J., Sharma U. (2024). Explaining neural scaling laws. Proc. Nat. Acad. Sci. 121:e2311878121. 10.1073/pnas.2311878121 - DOI - PMC - PubMed

LinkOut - more resources