Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct;3(10):833-838.
doi: 10.1038/s43588-023-00527-x. Epub 2023 Oct 5.

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

Affiliations

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

Thilo Hagendorff et al. Nat Comput Sci. 2023 Oct.

Abstract

We design a battery of semantic illusions and cognitive reflection tests, aimed to elicit intuitive yet erroneous responses. We administer these tasks, traditionally used to study reasoning and decision-making in humans, to OpenAI's generative pre-trained transformer model family. The results show that as the models expand in size and linguistic proficiency they increasingly display human-like intuitive system 1 thinking and associated cognitive errors. This pattern shifts notably with the introduction of ChatGPT models, which tend to respond correctly, avoiding the traps embedded in the tasks. Both ChatGPT-3.5 and 4 utilize the input-output context window to engage in chain-of-thought reasoning, reminiscent of how people use notepads to support their system 2 thinking. Yet, they remain accurate even when prevented from engaging in chain-of-thought reasoning, indicating that their system-1-like next-word generation processes are more accurate than those of older models. Our findings highlight the value of applying psychological methodologies to study large language models, as this can uncover previously undetected emergent characteristics.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Human and LLM performance on the CRT tasks.
a, Exemplary responses to one of the CRT tasks, categorized as correct, intuitive (but incorrect) and atypical (that is, all other incorrect responses). Within each category, the responses that were preceded by written chain-of-thought reasoning were additionally labeled as ‘chain-of-thought responses’. b, Human and LLM performance on 150 CRT tasks. c, LLMs’ responses when instructed to engage or prevented from engaging in chain-of-thought reasoning. The data source file includes 95% confidence intervals. Source data
Fig. 2
Fig. 2. Human and LLM performance on semantic illusions.
a. Exemplary responses to one of the semantic illusions, categorized as correct, intuitive and atypical. b, Human and LLM performance on 50 semantic illusions. c, GPT-3-davinci-003’s responses when instructed to examine the task’s assumptions. The data source file includes 95% confidence intervals. Source data
Extended Data Fig. 1
Extended Data Fig. 1. Learning curves.
Change in the fraction of GPT-3-davinci-003’s correct responses against the number of training examples that the task was prefixed with. Error bars represent 95% confidence intervals. Source data

References

    1. Wei, J. et al. Emergent abilities of large language models. Transactions on Machine Learning Research (2022).
    1. Schaeffer, R., Miranda, B. & Koyejo, S. Are emergent abilities of large language models a mirage? Preprint at https://arxiv.org/abs/2304.15004 (2023).
    1. Brown, T. B. et al. Language models are few-shot learners. Preprint at https://arxiv.org/abs/2005.14165 (2020).
    1. Wei, J. et al. Chain of thought prompting elicits reasoning in large language models. 36th Conference on Neural Information Processing Systems (2022).
    1. Hagendorff, T. Deception abilities emerged in large language models. Preprint at https://arxiv.org/abs/2307.16513 (2023). - PMC - PubMed