Quantifying uncert-AI-nty: Testing the accuracy of LLMs' confidence judgments
- PMID: 40694202
- DOI: 10.3758/s13421-025-01755-4
Quantifying uncert-AI-nty: Testing the accuracy of LLMs' confidence judgments
Abstract
The rise of Large Language Model (LLM) chatbots, such as ChatGPT and Gemini, has revolutionized how we access information. These LLMs can answer a wide array of questions on nearly any topic. When humans answer questions, especially difficult or uncertain questions, they often accompany their responses with metacognitive confidence judgments indicating their belief in their accuracy. LLMs are certainly capable of providing confidence judgments, but it is currently unclear how accurate these confidence judgments are. To fill this gap in the literature, the present studies investigate the capability of LLMs to quantify uncertainty through confidence judgments. We compare the absolute and relative accuracy of confidence judgments made by four LLMs (ChatGPT, Bard/Gemini, Sonnet, Haiku) and human participants in both domains of aleatory uncertainty-NFL predictions (Study 1; n = 502) and Oscar predictions (Study 2; n = 109)-and domains of epistemic uncertainty-Pictionary performance (Study 3; n = 164), Trivia questions (Study 4; n = 110), and questions about life at a university (Study 5; n = 110). We find several commonalities between LLMs and humans, such as achieving similar levels of absolute and relative metacognitive accuracy (although LLMs tend to be slightly more accurate on both dimensions). Like humans, we also find that LLMs tend to be overconfident. However, we find that, unlike humans, LLMs-especially ChatGPT and Gemini-often fail to adjust their confidence judgments based on past performance, highlighting a key metacognitive limitation.
Keywords: Artificial intelligence; Confidence judgments; Large Language Models; Metacognition; Metacognitive accuracy.
© 2025. The Author(s).
Conflict of interest statement
Declarations. Ethics Approval: Approval was obtained from the Institutional Review Board at Carnegie Mellon University (Study #2017_00000367). The procedures used in this study adhere to the tenets of the Declaration of Helsinki. Consent to participate: Informed consent was obtained from all individual participants included in the study. Consent for publication: Patients consented to the publication of their anonymized data. Open practices statement: The data and materials for all studies are available [ https://osf.io/b6qhx/?view_only=219cc3ad034542f6bd4271457f87ef1f ]. Study 2 [ https://aspredicted.org/T1L_2J4 ], Study 3 [ https://aspredicted.org/BJP_BZX ], Study 4 [ https://aspredicted.org/f33r-z39k.pdf ], and Study 5 [ https://aspredicted.org/yd7f-rjd5.pdf ] were preregistered on AsPredicted. Conflicts of interests/Competing interests: The authors have no relevant financial or non-financial interests to disclose. Author note: Studies 2, 3, 4, and 5 were preregistered (see links in manuscript). Data, supplemental materials, and analysis code for this manuscript are available online ( https://osf.io/b6qhx/?view_only=219cc3ad034542f6bd4271457f87ef1f ). This study was funded in part by a Doctoral Dissertation Research Improvement Grant from the National Science Foundation (Award #2333553) and a Small Undergraduate Research Grant from Carnegie Mellon University. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the views of our grantors. We declare no conflicts of interest. We acknowledge the support of the research assistants, friends, and family members who made this research possible.
Similar articles
-
Stench of Errors or the Shine of Potential: The Challenge of (Ir)Responsible Use of ChatGPT in Speech-Language Pathology.Int J Lang Commun Disord. 2025 Jul-Aug;60(4):e70088. doi: 10.1111/1460-6984.70088. Int J Lang Commun Disord. 2025. PMID: 40627744 Review.
-
Short-Term Memory Impairment.2024 Jun 8. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. 2024 Jun 8. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. PMID: 31424720 Free Books & Documents.
-
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3. Cochrane Database Syst Rev. 2022. PMID: 35593186 Free PMC article.
-
Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452. J Med Internet Res. 2025. PMID: 40658983 Free PMC article.
-
"In a State of Flow": A Qualitative Examination of Autistic Adults' Phenomenological Experiences of Task Immersion.Autism Adulthood. 2024 Sep 16;6(3):362-373. doi: 10.1089/aut.2023.0032. eCollection 2024 Sep. Autism Adulthood. 2024. PMID: 39371355
References
-
- Ackerman, R., & Levontin, L. (2024). Mindset effects on the regulation of thinking time in problem-solving. Thinking & Reasoning, 30(3), 479–508. https://doi.org/10.1080/13546783.2023.2259550 - DOI
-
- Ackerman, R., & Thompson, V. A. (2017). Meta-reasoning: Monitoring and control of thinking and reasoning. Trends in Cognitive Sciences, 21(8), 607–617. https://doi.org/10.1016/j.tics.2017.05.004 - DOI - PubMed
-
- Ackerman, R., & Zalmanov, H. (2012). The persistence of the fluency–confidence association in problem solving. Psychonomic Bulletin & Review, 19(6), 1187–1192. https://doi.org/10.3758/s13423-012-0305-z - DOI
-
- Ais, J., Zylberberg, A., Barttfeld, P., & Sigman, M. (2016). Individual consistency in the accuracy and distribution of confidence judgments. Cognition, 146, 377–386. https://doi.org/10.1016/j.cognition.2015.10.006 - DOI - PubMed
-
- Alter, A. L., & Oppenheimer, D. M. (2009). Uniting the tribes of fluency to form a metacognitive nation. Personality and Social Psychology Review, 13(3), 219–235. https://doi.org/10.1177/1088868309341564 - DOI - PubMed
Grants and funding
LinkOut - more resources
Full Text Sources
Miscellaneous