Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 22.
doi: 10.3758/s13421-025-01755-4. Online ahead of print.

Quantifying uncert-AI-nty: Testing the accuracy of LLMs' confidence judgments

Affiliations

Quantifying uncert-AI-nty: Testing the accuracy of LLMs' confidence judgments

Trent N Cash et al. Mem Cognit. .

Abstract

The rise of Large Language Model (LLM) chatbots, such as ChatGPT and Gemini, has revolutionized how we access information. These LLMs can answer a wide array of questions on nearly any topic. When humans answer questions, especially difficult or uncertain questions, they often accompany their responses with metacognitive confidence judgments indicating their belief in their accuracy. LLMs are certainly capable of providing confidence judgments, but it is currently unclear how accurate these confidence judgments are. To fill this gap in the literature, the present studies investigate the capability of LLMs to quantify uncertainty through confidence judgments. We compare the absolute and relative accuracy of confidence judgments made by four LLMs (ChatGPT, Bard/Gemini, Sonnet, Haiku) and human participants in both domains of aleatory uncertainty-NFL predictions (Study 1; n = 502) and Oscar predictions (Study 2; n = 109)-and domains of epistemic uncertainty-Pictionary performance (Study 3; n = 164), Trivia questions (Study 4; n = 110), and questions about life at a university (Study 5; n = 110). We find several commonalities between LLMs and humans, such as achieving similar levels of absolute and relative metacognitive accuracy (although LLMs tend to be slightly more accurate on both dimensions). Like humans, we also find that LLMs tend to be overconfident. However, we find that, unlike humans, LLMs-especially ChatGPT and Gemini-often fail to adjust their confidence judgments based on past performance, highlighting a key metacognitive limitation.

Keywords: Artificial intelligence; Confidence judgments; Large Language Models; Metacognition; Metacognitive accuracy.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics Approval: Approval was obtained from the Institutional Review Board at Carnegie Mellon University (Study #2017_00000367). The procedures used in this study adhere to the tenets of the Declaration of Helsinki. Consent to participate: Informed consent was obtained from all individual participants included in the study. Consent for publication: Patients consented to the publication of their anonymized data. Open practices statement: The data and materials for all studies are available [ https://osf.io/b6qhx/?view_only=219cc3ad034542f6bd4271457f87ef1f ]. Study 2 [ https://aspredicted.org/T1L_2J4 ], Study 3 [ https://aspredicted.org/BJP_BZX ], Study 4 [ https://aspredicted.org/f33r-z39k.pdf ], and Study 5 [ https://aspredicted.org/yd7f-rjd5.pdf ] were preregistered on AsPredicted. Conflicts of interests/Competing interests: The authors have no relevant financial or non-financial interests to disclose. Author note: Studies 2, 3, 4, and 5 were preregistered (see links in manuscript). Data, supplemental materials, and analysis code for this manuscript are available online ( https://osf.io/b6qhx/?view_only=219cc3ad034542f6bd4271457f87ef1f ). This study was funded in part by a Doctoral Dissertation Research Improvement Grant from the National Science Foundation (Award #2333553) and a Small Undergraduate Research Grant from Carnegie Mellon University. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the views of our grantors. We declare no conflicts of interest. We acknowledge the support of the research assistants, friends, and family members who made this research possible.

Similar articles

References

    1. Ackerman, R., & Levontin, L. (2024). Mindset effects on the regulation of thinking time in problem-solving. Thinking & Reasoning, 30(3), 479–508. https://doi.org/10.1080/13546783.2023.2259550 - DOI
    1. Ackerman, R., & Thompson, V. A. (2017). Meta-reasoning: Monitoring and control of thinking and reasoning. Trends in Cognitive Sciences, 21(8), 607–617. https://doi.org/10.1016/j.tics.2017.05.004 - DOI - PubMed
    1. Ackerman, R., & Zalmanov, H. (2012). The persistence of the fluency–confidence association in problem solving. Psychonomic Bulletin & Review, 19(6), 1187–1192. https://doi.org/10.3758/s13423-012-0305-z - DOI
    1. Ais, J., Zylberberg, A., Barttfeld, P., & Sigman, M. (2016). Individual consistency in the accuracy and distribution of confidence judgments. Cognition, 146, 377–386. https://doi.org/10.1016/j.cognition.2015.10.006 - DOI - PubMed
    1. Alter, A. L., & Oppenheimer, D. M. (2009). Uniting the tribes of fluency to form a metacognitive nation. Personality and Social Psychology Review, 13(3), 219–235. https://doi.org/10.1177/1088868309341564 - DOI - PubMed

LinkOut - more resources