Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Feb 7;120(6):e2218523120.
doi: 10.1073/pnas.2218523120. Epub 2023 Feb 2.

Using cognitive psychology to understand GPT-3

Affiliations

Using cognitive psychology to understand GPT-3

Marcel Binz et al. Proc Natl Acad Sci U S A. .

Abstract

We study GPT-3, a recent large language model, using tools from cognitive psychology. More specifically, we assess GPT-3's decision-making, information search, deliberation, and causal reasoning abilities on a battery of canonical experiments from the literature. We find that much of GPT-3's behavior is impressive: It solves vignette-based tasks similarly or better than human subjects, is able to make decent decisions from descriptions, outperforms humans in a multiarmed bandit task, and shows signatures of model-based reinforcement learning. Yet, we also find that small perturbations to vignette-based tasks can lead GPT-3 vastly astray, that it shows no signatures of directed exploration, and that it fails miserably in a causal reasoning task. Taken together, these results enrich our understanding of current large language models and pave the way for future investigations using tools from cognitive psychology to study increasingly capable and opaque artificial agents.

Keywords: artificial intelligence; cognitive psychology; decision-making; language models; reasoning.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

Fig. 1.
Fig. 1.
Vignette-based tasks. (A) Example prompt of a hypothetical scenario, in this case, the famous Linda problem, as submitted to GPT-3. (B) Results. While in 12 out 12 standard vignettes, GPT-3 answers either correctly or makes human-like mistakes, it makes mistakes that are not human-like when given the adversarial vignettes.
Fig. 2.
Fig. 2.
Decisions from descriptions. (A) Example prompt of a problem provided to GPT-3. (B) Example prompt of a problem provided to GPT-3. (C) Mean regret averaged over all 13,000 problems taken from Peterson et al. (23). Lower regret means better performance. Error bars indicate the SE of the mean. (D) Log-odds ratios of different contrasts used to test for cognitive biases. Positive values indicate that the given bias is present in humans (circle) or GPT-3 (triangle). Human data adapted from Ruggeri et al. (24).
Fig. 3.
Fig. 3.
Horizon task. (A) Visual overview of the horizon task paradigm. Each column pair corresponds to one example task. (B) Example prompt for one trial as submitted to GPT-3. (C) Mean regret for GPT-3 and human subjects by horizon condition. Lower regret means better performance. Error bars indicate the SE of the mean. Human data taken from Zaller et al. (29).
Fig. 4.
Fig. 4.
Two-step task. (A) Visual overview of the two-step task paradigm. (B) Example prompt of one trial in the canonical two-step task as submitted to GPT-3. (C) Model-free learning in dependency of rewarded and unrewarded as well as common and rare transitions. (D) Model-based learning in dependency of rewarded and unrewarded as well as common and rare transitions. (E) Human behavior in dependency of rewarded and unrewarded as well as common and rare transitions. Human data adapted from Daw et al. (30). (F) GPT-3’s behavior in dependency of rewarded and unrewarded as well as common and rare transitions. Error bars indicate the SE of the mean.
Fig. 5.
Fig. 5.
Causal reasoning. (A) Example prompt for the causal reasoning task adapted from Waldmann and Hagmayer (31). (B) GPT-3’s responses alongside responses of people and an ideal agent in the common-cause condition. (C) GPT-3’s responses alongside responses of people and an ideal agent in the causal-chain condition.
Fig. 6.
Fig. 6.
Prompt variations. (A) Performance for different prompt variations in the decisions from the descriptions paradigm. (B) Performance for different prompt variations in the horizon task. (C) Effect of random exploration for different prompt variations in the horizon task. (D) Effect of directed exploration for different prompt variations in the horizon task. (E) GPT-3’s behavior in dependency of rewarded and unrewarded as well as common and rare transitions for the alien cover story (reproduced from Fig. 4F). (F). GPT-3’s behavior in dependency of rewarded and unrewarded as well as common and rare transitions for the magical carpet cover story. Error bars indicate the SE of the mean.

Comment in

  • Probing the psychology of AI models.
    Shiffrin R, Mitchell M. Shiffrin R, et al. Proc Natl Acad Sci U S A. 2023 Mar 7;120(10):e2300963120. doi: 10.1073/pnas.2300963120. Epub 2023 Mar 1. Proc Natl Acad Sci U S A. 2023. PMID: 36857344 Free PMC article. No abstract available.

References

    1. D. Gunning et al., XAI–explainable artificial intelligence. Sci. Rob. 4, eaay7120 (2019). - PubMed
    1. Brown T., et al. , Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
    1. M. Chen et al., Evaluating large language models trained on code. arXiv [Preprint] (2021).http://arxiv.org/abs/2107.03374 (Accessed 20 January 2023).
    1. Lin Z., et al. , Caire: An end-to-end empathetic chatbot. Proceedings of the AAAI Conference on Artificial Intelligence 34, 13622–13623 (2020).
    1. D. Noever, M. Ciolino, J. Kalin, The chess transformer: Mastering play using generative language models. arXiv [Preprint] (2020). http://arxiv.org/abs/2008.04057 (Accessed 20 January 2023).

Publication types

LinkOut - more resources