Quantifying uncert-AI-nty: Testing the accuracy of LLMs' confidence judgments

Trent N Cash^{1

2}, Daniel M Oppenheimer^{3

4}, Sara Christie⁴, Mira Devgan⁴

Affiliations

¹ Department of Social and Decision Sciences, Carnegie Mellon University, 5000 Forbes Ave., 224 Porter Hall, Pittsburgh, PA, 15213, USA. trentncash@gmail.com.
² Department of Psychology, Carnegie Mellon University, Pittsburgh, PA, USA. trentncash@gmail.com.
³ Department of Social and Decision Sciences, Carnegie Mellon University, 5000 Forbes Ave., 224 Porter Hall, Pittsburgh, PA, 15213, USA.
⁴ Department of Psychology, Carnegie Mellon University, Pittsburgh, PA, USA.

PMID: 40694202
DOI: 10.3758/s13421-025-01755-4

Quantifying uncert-AI-nty: Testing the accuracy of LLMs' confidence judgments

Trent N Cash et al. Mem Cognit. 2025.

. 2025 Jul 22.

doi: 10.3758/s13421-025-01755-4. Online ahead of print.

Authors

Trent N Cash^{1

2}, Daniel M Oppenheimer^{3

4}, Sara Christie⁴, Mira Devgan⁴

Affiliations

¹ Department of Social and Decision Sciences, Carnegie Mellon University, 5000 Forbes Ave., 224 Porter Hall, Pittsburgh, PA, 15213, USA. trentncash@gmail.com.
² Department of Psychology, Carnegie Mellon University, Pittsburgh, PA, USA. trentncash@gmail.com.
³ Department of Social and Decision Sciences, Carnegie Mellon University, 5000 Forbes Ave., 224 Porter Hall, Pittsburgh, PA, 15213, USA.
⁴ Department of Psychology, Carnegie Mellon University, Pittsburgh, PA, USA.

PMID: 40694202
DOI: 10.3758/s13421-025-01755-4

Abstract

The rise of Large Language Model (LLM) chatbots, such as ChatGPT and Gemini, has revolutionized how we access information. These LLMs can answer a wide array of questions on nearly any topic. When humans answer questions, especially difficult or uncertain questions, they often accompany their responses with metacognitive confidence judgments indicating their belief in their accuracy. LLMs are certainly capable of providing confidence judgments, but it is currently unclear how accurate these confidence judgments are. To fill this gap in the literature, the present studies investigate the capability of LLMs to quantify uncertainty through confidence judgments. We compare the absolute and relative accuracy of confidence judgments made by four LLMs (ChatGPT, Bard/Gemini, Sonnet, Haiku) and human participants in both domains of aleatory uncertainty-NFL predictions (Study 1; n = 502) and Oscar predictions (Study 2; n = 109)-and domains of epistemic uncertainty-Pictionary performance (Study 3; n = 164), Trivia questions (Study 4; n = 110), and questions about life at a university (Study 5; n = 110). We find several commonalities between LLMs and humans, such as achieving similar levels of absolute and relative metacognitive accuracy (although LLMs tend to be slightly more accurate on both dimensions). Like humans, we also find that LLMs tend to be overconfident. However, we find that, unlike humans, LLMs-especially ChatGPT and Gemini-often fail to adjust their confidence judgments based on past performance, highlighting a key metacognitive limitation.

Keywords: Artificial intelligence; Confidence judgments; Large Language Models; Metacognition; Metacognitive accuracy.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics Approval: Approval was obtained from the Institutional Review Board at Carnegie Mellon University (Study #2017_00000367). The procedures used in this study adhere to the tenets of the Declaration of Helsinki. Consent to participate: Informed consent was obtained from all individual participants included in the study. Consent for publication: Patients consented to the publication of their anonymized data. Open practices statement: The data and materials for all studies are available [ https://osf.io/b6qhx/?view_only=219cc3ad034542f6bd4271457f87ef1f ]. Study 2 [ https://aspredicted.org/T1L_2J4 ], Study 3 [ https://aspredicted.org/BJP_BZX ], Study 4 [ https://aspredicted.org/f33r-z39k.pdf ], and Study 5 [ https://aspredicted.org/yd7f-rjd5.pdf ] were preregistered on AsPredicted. Conflicts of interests/Competing interests: The authors have no relevant financial or non-financial interests to disclose. Author note: Studies 2, 3, 4, and 5 were preregistered (see links in manuscript). Data, supplemental materials, and analysis code for this manuscript are available online ( https://osf.io/b6qhx/?view_only=219cc3ad034542f6bd4271457f87ef1f ). This study was funded in part by a Doctoral Dissertation Research Improvement Grant from the National Science Foundation (Award #2333553) and a Small Undergraduate Research Grant from Carnegie Mellon University. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the views of our grantors. We declare no conflicts of interest. We acknowledge the support of the research assistants, friends, and family members who made this research possible.

References

1. Ackerman, R., & Levontin, L. (2024). Mindset effects on the regulation of thinking time in problem-solving. Thinking & Reasoning, 30(3), 479–508. https://doi.org/10.1080/13546783.2023.2259550 - DOI
1. Ackerman, R., & Thompson, V. A. (2017). Meta-reasoning: Monitoring and control of thinking and reasoning. Trends in Cognitive Sciences, 21(8), 607–617. https://doi.org/10.1016/j.tics.2017.05.004 - DOI - PubMed
1. Ackerman, R., & Zalmanov, H. (2012). The persistence of the fluency–confidence association in problem solving. Psychonomic Bulletin & Review, 19(6), 1187–1192. https://doi.org/10.3758/s13423-012-0305-z - DOI
1. Ais, J., Zylberberg, A., Barttfeld, P., & Sigman, M. (2016). Individual consistency in the accuracy and distribution of confidence judgments. Cognition, 146, 377–386. https://doi.org/10.1016/j.cognition.2015.10.006 - DOI - PubMed
1. Alter, A. L., & Oppenheimer, D. M. (2009). Uniting the tribes of fluency to form a metacognitive nation. Personality and Social Psychology Review, 13(3), 219–235. https://doi.org/10.1177/1088868309341564 - DOI - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
- Springer
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Quantifying uncert-AI-nty: Testing the accuracy of LLMs' confidence judgments

Affiliations

Quantifying uncert-AI-nty: Testing the accuracy of LLMs' confidence judgments

Authors

Affiliations

Abstract

Conflict of interest statement

Similar articles

References

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Similar articles

References

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous