Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 1;15(1):21428.
doi: 10.1038/s41598-025-01715-7.

A publicly available benchmark for assessing large language models' ability to predict how humans balance self-interest and the interest of others

Affiliations

A publicly available benchmark for assessing large language models' ability to predict how humans balance self-interest and the interest of others

Valerio Capraro et al. Sci Rep. .

Abstract

Large language models (LLMs) hold enormous potential to assist humans in decision-making processes, from everyday to high-stake scenarios. However, as many human decisions carry social implications, for LLMs to be reliable assistants a necessary prerequisite is that they are able to capture how humans balance self-interest and the interest of others. Here we introduce a novel, publicly available, benchmark to test LLM's ability to predict how humans balance monetary self-interest and the interest of others. This benchmark consists of 106 textual instructions from dictator games experiments conducted with human participants from 12 countries, alongside with a compendium of actual human behavior in each experiment. We investigate the ability of four advanced chatbots against this benchmark. We find that none of these chatbots meet the benchmark. In particular, only GPT-4 and GPT-4o (not Bard nor Bing) correctly capture qualitative behavioral patterns, identifying three major classes of behavior: self-interested, inequity-averse, and fully altruistic. Nonetheless, GPT-4 and GPT-4o consistently underestimate self-interest, while overestimating altruistic behavior. In sum, this article introduces a publicly available resource for testing the capacity of LLMs to estimate human other-regarding preferences in economic decisions and reveals an "optimistic bias" in current versions of GPT.

Keywords: Altruism; Dictator game; Economic games; Generative artificial intelligence; Human behavior.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Predicted vs actual average giving in the standard dictator game. Each dot represents an experiment where human participants played the standard dictator game. On the horizontal axis, we report the actual average giving, on the vertical axis we report the average giving predicted by GPT-4. The red line corresponds to the 45° line.
Fig. 2
Fig. 2
Predicted vs actual distribution of giving in the standard dictator game. Red diamonds represent the distribution of giving in the standard dictator games. Blue bars represent the distribution of giving predicted by GPT-4 in the same games. Error bars represent standard errors of the mean.
Fig. 3
Fig. 3
Predicted vs actual frequency of monetary altruism in the extreme dictator games. Red diamonds represent the frequency of monetary altruistic choices in each of the six conditions of the extreme dictator game reported in. Blue bars represent the frequency of altruism in the six conditions predicted by GPT-4. Error bars represent standard errors of the mean.
Fig. 4
Fig. 4
Predicted vs actual distribution of giving in the dictator game with a “take” option. Red diamonds represent the distribution of giving in the dictator games with a “take” option reported in. Blue bars represent the distribution of giving in the dictator games with a “take” option predicted by GPT-4. Error bars represent standard errors of the mean.

Similar articles

Cited by

References

    1. Epstein, Z. et al. Art and the science of generative AI. Science380, 1110–1111 (2023). - PubMed
    1. Xing, F. Z., Cambria, E. & Welsch, R. E. Natural language based financial forecasting: A survey. Artif. Intell. Rev.50, 49–73 (2018).
    1. Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N. Engl. J. Med.388, 1233–1239 (2023). - PubMed
    1. Noy, S. & Zhang, W. Experimental evidence on the productivity effects of generative artificial intelligence. Science381, 187–192 (2023). - PubMed
    1. Brynjolfsson, E., Li, D. & Raymond, L. R. Generative AI at work. Natl. Bur. Econ. Res. (2023).

LinkOut - more resources