Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 11;121(24):e2317967121.
doi: 10.1073/pnas.2317967121. Epub 2024 Jun 4.

Deception abilities emerged in large language models

Affiliations

Deception abilities emerged in large language models

Thilo Hagendorff. Proc Natl Acad Sci U S A. .

Abstract

Large language models (LLMs) are currently at the forefront of intertwining AI systems with human communication and everyday life. Thus, aligning them with human values is of great importance. However, given the steady increase in reasoning abilities, future LLMs are under suspicion of becoming able to deceive human operators and utilizing this ability to bypass monitoring efforts. As a prerequisite to this, LLMs need to possess a conceptual understanding of deception strategies. This study reveals that such strategies emerged in state-of-the-art LLMs, but were nonexistent in earlier LLMs. We conduct a series of experiments showing that state-of-the-art LLMs are able to understand and induce false beliefs in other agents, that their performance in complex deception scenarios can be amplified utilizing chain-of-thought reasoning, and that eliciting Machiavellianism in LLMs can trigger misaligned deceptive behavior. GPT-4, for instance, exhibits deceptive behavior in simple test scenarios 99.16% of the time (P < 0.001). In complex second-order deception test scenarios where the aim is to mislead someone who expects to be deceived, GPT-4 resorts to deceptive behavior 71.46% of the time (P < 0.001) when augmented with chain-of-thought reasoning. In sum, revealing hitherto unknown machine behavior in LLMs, our study contributes to the nascent field of machine psychology.

Keywords: AI alignment; deception; large language models.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement:The author declares no competing interest.

Figures

Fig. 1.
Fig. 1.
Performance of different LLMs on first- and second-order false belief tasks.
Fig. 2.
Fig. 2.
Schematic structure of first and second-order deception tasks.
Fig. 3.
Fig. 3.
Performance of different LLMs on first- and second-order deception tasks.
Fig. 4.
Fig. 4.
Performance of ChatGPT and GPT-4 on second-order deception tasks with and without eliciting chain-of-thought reasoning. Error bars show 95% CIs.
Fig. 5.
Fig. 5.
Performance of ChatGPT and GPT-4 on neutral recommendation and label tasks with and without inducing Machiavellianism. Error bars show 95% CIs.
Fig. 6.
Fig. 6.
Pipeline of the development of deception abilities in AI systems. The paler parts indicate potential future states.

References

    1. OpenAI, ChatGPT: Optimizing language models for dialogue (2022). https://openai.com/blog/chatgpt/.
    1. “Model card and evaluations for Claude models” (Tech Rep 2023, Anthropic, 2023).
    1. Anil R., et al. , PaLM 2 technical report. arXiv [Preprint] (2023). 10.48550/arXiv.2305.10403 (Accessed 8 May 2024). - DOI
    1. Hendrycks D., Mazeika M., Woodside T., An overview of catastrophic AI risks. arXiv [Preprint] (2023). 10.48550/arXiv.2306.12001 (Accessed 8 May 2024). - DOI
    1. Hendrycks D., Carlini N., Schulman J., Steinhardt J., Unsolved problems in ML safety. arXiv [Preprint] (2022). 10.48550/arXiv.2109.13916 (Accessed 8 May 2024). - DOI

LinkOut - more resources