Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Mar 14:2024.03.12.24303785.
doi: 10.1101/2024.03.12.24303785.

Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study

Affiliations

Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study

Ethan Goh et al. medRxiv. .

Update in

Abstract

Importance: Diagnostic errors are common and cause significant morbidity. Large language models (LLMs) have shown promise in their performance on both multiple-choice and open-ended medical reasoning examinations, but it remains unknown whether the use of such tools improves diagnostic reasoning.

Objective: To assess the impact of the GPT-4 LLM on physicians' diagnostic reasoning compared to conventional resources.

Design: Multi-center, randomized clinical vignette study.

Setting: The study was conducted using remote video conferencing with physicians across the country and in-person participation across multiple academic medical institutions.

Participants: Resident and attending physicians with training in family medicine, internal medicine, or emergency medicine.

Interventions: Participants were randomized to access GPT-4 in addition to conventional diagnostic resources or to just conventional resources. They were allocated 60 minutes to review up to six clinical vignettes adapted from established diagnostic reasoning exams.

Main outcomes and measures: The primary outcome was diagnostic performance based on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps. Secondary outcomes included time spent per case and final diagnosis.

Results: 50 physicians (26 attendings, 24 residents) participated, with an average of 5.2 cases completed per participant. The median diagnostic reasoning score per case was 76.3 percent (IQR 65.8 to 86.8) for the GPT-4 group and 73.7 percent (IQR 63.2 to 84.2) for the conventional resources group, with an adjusted difference of 1.6 percentage points (95% CI -4.4 to 7.6; p=0.60). The median time spent on cases for the GPT-4 group was 519 seconds (IQR 371 to 668 seconds), compared to 565 seconds (IQR 456 to 788 seconds) for the conventional resources group, with a time difference of -82 seconds (95% CI -195 to 31; p=0.20). GPT-4 alone scored 15.5 percentage points (95% CI 1.5 to 29, p=0.03) higher than the conventional resources group.

Conclusions and relevance: In a clinical vignette-based study, the availability of GPT-4 to physicians as a diagnostic aid did not significantly improve clinical reasoning compared to conventional resources, although it may improve components of clinical reasoning such as efficiency. GPT-4 alone demonstrated higher performance than both physician groups, suggesting opportunities for further improvement in physician-AI collaboration in clinical practice.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
50 physicians randomized to complete diagnosis quiz with GPT-4 vs. conventional resources. Participants were asked to offer differential diagnosis with supporting statements of findings in favor or against each differential, and to propose best next diagnostic evaluation steps.

References

    1. Shojania KG, Burton EC, McDonald KM, Goldman L. Changes in rates of autopsy-detected diagnostic errors over time: a systematic review. JAMA. 2003;289(21):2849–2856. doi:10.1001/jama.289.21.2849 - DOI - PubMed
    1. Singh H, Giardina TD, Meyer AND, Forjuoh SN, Reis MD, Thomas EJ. Types and origins of diagnostic errors in primary care settings. JAMA Intern Med. 2013;173(6):418–425. doi:10.1001/jamainternmed.2013.2777 - DOI - PMC - PubMed
    1. Auerbach AD, Lee TM, Hubbard CC, et al. Diagnostic Errors in Hospitalized Adults Who Died or Were Transferred to Intensive Care. JAMA Intern Med. 2024;184(2):164–173. doi:10.1001/jamainternmed.2023.7347 - DOI - PMC - PubMed
    1. Balogh EP, Miller BT, Ball JR. Improving Diagnosis in Health Care. Improving Diagnosis in Health Care. Published online January 29, 2015:1–472. doi:10.17226/21794 - DOI
    1. Newman-Toker DE, Peterson SM, Badihian S, et al. Diagnostic Errors in the Emergency Department: A Systematic Review. Published online December 15, 2022. doi:10.23970/AHRQEPCCER258 - DOI - PubMed

Publication types