. 2025 Jul 1;15(1):21428.

doi: 10.1038/s41598-025-01715-7.

A publicly available benchmark for assessing large language models' ability to predict how humans balance self-interest and the interest of others

Valerio Capraro¹, Roberto Di Paolo², Veronica Pizziol³

Affiliations

¹ Department of Psychology, University of Milan Bicocca, 20126, Milan, Italy. valerio.capraro@unimib.it.
² Department of Economics and Management, University of Parma, 43121, Parma, Italy.
³ Department of Economics, University of Bologna, 40126, Bologna, Italy.

PMID: 40595689
PMCID: PMC12216366
DOI: 10.1038/s41598-025-01715-7

A publicly available benchmark for assessing large language models' ability to predict how humans balance self-interest and the interest of others

Valerio Capraro et al. Sci Rep. 2025.

. 2025 Jul 1;15(1):21428.

doi: 10.1038/s41598-025-01715-7.

Authors

Valerio Capraro¹, Roberto Di Paolo², Veronica Pizziol³

Affiliations

¹ Department of Psychology, University of Milan Bicocca, 20126, Milan, Italy. valerio.capraro@unimib.it.
² Department of Economics and Management, University of Parma, 43121, Parma, Italy.
³ Department of Economics, University of Bologna, 40126, Bologna, Italy.

PMID: 40595689
PMCID: PMC12216366
DOI: 10.1038/s41598-025-01715-7

Abstract

Large language models (LLMs) hold enormous potential to assist humans in decision-making processes, from everyday to high-stake scenarios. However, as many human decisions carry social implications, for LLMs to be reliable assistants a necessary prerequisite is that they are able to capture how humans balance self-interest and the interest of others. Here we introduce a novel, publicly available, benchmark to test LLM's ability to predict how humans balance monetary self-interest and the interest of others. This benchmark consists of 106 textual instructions from dictator games experiments conducted with human participants from 12 countries, alongside with a compendium of actual human behavior in each experiment. We investigate the ability of four advanced chatbots against this benchmark. We find that none of these chatbots meet the benchmark. In particular, only GPT-4 and GPT-4o (not Bard nor Bing) correctly capture qualitative behavioral patterns, identifying three major classes of behavior: self-interested, inequity-averse, and fully altruistic. Nonetheless, GPT-4 and GPT-4o consistently underestimate self-interest, while overestimating altruistic behavior. In sum, this article introduces a publicly available resource for testing the capacity of LLMs to estimate human other-regarding preferences in economic decisions and reveals an "optimistic bias" in current versions of GPT.

Keywords: Altruism; Dictator game; Economic games; Generative artificial intelligence; Human behavior.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

**Fig. 1**
Predicted vs actual average giving in the standard dictator game. Each dot represents an experiment where human participants played the standard dictator game. On the horizontal axis, we report the actual average giving, on the vertical axis we report the average giving predicted by GPT-4. The red line corresponds to the 45° line.

**Fig. 2**
Predicted vs actual distribution of giving in the standard dictator game. Red diamonds represent the distribution of giving in the standard dictator games. Blue bars represent the distribution of giving predicted by GPT-4 in the same games. Error bars represent standard errors of the mean.

**Fig. 3**
Predicted vs actual frequency of monetary altruism in the extreme dictator games. Red diamonds represent the frequency of monetary altruistic choices in each of the six conditions of the extreme dictator game reported in. Blue bars represent the frequency of altruism in the six conditions predicted by GPT-4. Error bars represent standard errors of the mean.

**Fig. 4**
Predicted vs actual distribution of giving in the dictator game with a “take” option. Red diamonds represent the distribution of giving in the dictator games with a “take” option reported in. Blue bars represent the distribution of giving in the dictator games with a “take” option predicted by GPT-4. Error bars represent standard errors of the mean.

See this image and copyright information in PMC

Cited by

Evaluating the ability of large Language models to predict human social decisions.
Xiao F, Wang XTX. Xiao F, et al. Sci Rep. 2025 Sep 2;15(1):32290. doi: 10.1038/s41598-025-17188-7. Sci Rep. 2025. PMID: 40897780 Free PMC article.

References

1. Epstein, Z. et al. Art and the science of generative AI. Science380, 1110–1111 (2023). - PubMed
1. Xing, F. Z., Cambria, E. & Welsch, R. E. Natural language based financial forecasting: A survey. Artif. Intell. Rev.50, 49–73 (2018).
1. Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N. Engl. J. Med.388, 1233–1239 (2023). - PubMed
1. Noy, S. & Zhang, W. Experimental evidence on the productivity effects of generative artificial intelligence. Science381, 187–192 (2023). - PubMed
1. Brynjolfsson, E., Li, D. & Raymond, L. R. Generative AI at work. Natl. Bur. Econ. Res. (2023).

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A publicly available benchmark for assessing large language models' ability to predict how humans balance self-interest and the interest of others

Affiliations

A publicly available benchmark for assessing large language models' ability to predict how humans balance self-interest and the interest of others

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Related information

LinkOut - more resources

Full Text Sources