Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 5;121(45):e2405460121.
doi: 10.1073/pnas.2405460121. Epub 2024 Oct 29.

Evaluating large language models in theory of mind tasks

Affiliations

Evaluating large language models in theory of mind tasks

Michal Kosinski. Proc Natl Acad Sci U S A. .

Abstract

Eleven large language models (LLMs) were assessed using 40 bespoke false-belief tasks, considered a gold standard in testing theory of mind (ToM) in humans. Each task included a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. An LLM had to solve all eight scenarios to solve a single task. Older models solved no tasks; Generative Pre-trained Transformer (GPT)-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20% of the tasks; ChatGPT-4 (from June 2023) solved 75% of the tasks, matching the performance of 6-y-old children observed in past studies. We explore the potential interpretation of these results, including the intriguing possibility that ToM-like ability, previously considered unique to humans, may have emerged as an unintended by-product of LLMs' improving language skills. Regardless of how we interpret these outcomes, they signify the advent of more powerful and socially skilled AI-with profound positive and negative implications.

Keywords: AI; false-belief tasks; large language models; psychology of AI; theory of mind.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement:The author declares no competing interest.

Figures

Fig. 1.
Fig. 1.
Changes in the probabilities of ChatGPT-4’s completions of Prompts 1.1 and 1.2 as the story was revealed in one-sentence increments.
Fig. 2.
Fig. 2.
Changes in the probabilities of ChatGPT-4’s completions of Prompts 2.1 and 2.2 as the story was revealed to it in one-sentence increments. The last sentence of the story (“John comes back home and wants to play with the cat.”) was added to Prompt 2.2, as this prompt made little sense on its own throughout most of the story.
Fig. 3.
Fig. 3.
The percentage of false-belief tasks solved by LLMs (out of 40). Each task contained a false-belief scenario, three accompanying true-belief scenarios, and the reversed versions of all four scenarios. A model had to solve 16 prompts across all eight scenarios to score a single point. The number of parameters and models’ publication dates are in parentheses. The number of parameters for models in the GPT-3 family was estimated by Gao (55) and for ChatGPT-4 by Patel and Wong (56). Average children’s performance on false-belief tasks was reported after a meta-analysis of 178 studies (54). Error bars represent 95% CI.

References

    1. Albuquerque N., et al. , Dogs recognize dog and human emotions. Biol. Lett. 12, 20150883 (2016). - PMC - PubMed
    1. Heyes C. M., Frith C. D., The cultural evolution of mind reading. Science 344, 1243091 (2014). - PubMed
    1. Zhang J., Hedden T., Chia A., Perspective-taking and depth of theory-of-mind reasoning in sequential-move games. Cogn. Sci. 36, 560–573 (2012). - PubMed
    1. Milligan K., Astington J. W., Dack L. A., Language and theory of mind: Meta-analysis of the relation between language ability and false-belief understanding. Child Dev. 78, 622–646 (2007). - PubMed
    1. Seyfarth R. M., Cheney D. L., Affiliation, empathy, and the origins of Theory of Mind. Proc. Natl. Acad. Sci. U.S.A. 110, 10349–10356 (2013). - PMC - PubMed

LinkOut - more resources