Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Mar 29:2023.03.24.23287731.
doi: 10.1101/2023.03.24.23287731.

Performance of ChatGPT on free-response, clinical reasoning exams

Performance of ChatGPT on free-response, clinical reasoning exams

Eric Strong et al. medRxiv. .

Update in

Abstract

Importance: Studies show that ChatGPT, a general purpose large language model chatbot, could pass the multiple-choice US Medical Licensing Exams, but the model's performance on open-ended clinical reasoning is unknown.

Objective: To determine if ChatGPT is capable of consistently meeting the passing threshold on free-response, case-based clinical reasoning assessments.

Design: Fourteen multi-part cases were selected from clinical reasoning exams administered to pre-clerkship medical students between 2019 and 2022. For each case, the questions were run through ChatGPT twice and responses were recorded. Two clinician educators independently graded each run according to a standardized grading rubric. To further assess the degree of variation in ChatGPT's performance, we repeated the analysis on a single high-complexity case 20 times.

Setting: A single US medical school.

Participants: ChatGPT.

Main outcomes and measures: Passing rate of ChatGPT's scored responses and the range in model performance across multiple run throughs of a single case.

Results: 12 out of the 28 ChatGPT exam responses achieved a passing score (43%) with a mean score of 69% (95% CI: 65% to 73%) compared to the established passing threshold of 70%. When given the same case 20 separate times, ChatGPT's performance on that case varied with scores ranging from 56% to 81%.

Conclusions and relevance: ChatGPT's ability to achieve a passing performance in nearly half of the cases analyzed demonstrates the need to revise clinical reasoning assessments and incorporate artificial intelligence (AI)-related topics into medical curricula and practice.

PubMed Disclaimer

Publication types

LinkOut - more resources