Evaluating large language models in theory of mind tasks
- PMID: 39471222
- PMCID: PMC11551352
- DOI: 10.1073/pnas.2405460121
Evaluating large language models in theory of mind tasks
Abstract
Eleven large language models (LLMs) were assessed using 40 bespoke false-belief tasks, considered a gold standard in testing theory of mind (ToM) in humans. Each task included a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. An LLM had to solve all eight scenarios to solve a single task. Older models solved no tasks; Generative Pre-trained Transformer (GPT)-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20% of the tasks; ChatGPT-4 (from June 2023) solved 75% of the tasks, matching the performance of 6-y-old children observed in past studies. We explore the potential interpretation of these results, including the intriguing possibility that ToM-like ability, previously considered unique to humans, may have emerged as an unintended by-product of LLMs' improving language skills. Regardless of how we interpret these outcomes, they signify the advent of more powerful and socially skilled AI-with profound positive and negative implications.
Keywords: AI; false-belief tasks; large language models; psychology of AI; theory of mind.
Conflict of interest statement
Competing interests statement:The author declares no competing interest.
Figures
References
-
- Heyes C. M., Frith C. D., The cultural evolution of mind reading. Science 344, 1243091 (2014). - PubMed
-
- Zhang J., Hedden T., Chia A., Perspective-taking and depth of theory-of-mind reasoning in sequential-move games. Cogn. Sci. 36, 560–573 (2012). - PubMed
-
- Milligan K., Astington J. W., Dack L. A., Language and theory of mind: Meta-analysis of the relation between language ability and false-belief understanding. Child Dev. 78, 622–646 (2007). - PubMed
MeSH terms
LinkOut - more resources
Full Text Sources
Miscellaneous
