Randomized Controlled Trial

. 2024 Oct 1;7(10):e2440969.

doi: 10.1001/jamanetworkopen.2024.40969.

Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial

Ethan Goh^{1

2}, Robert Gallo³, Jason Hom⁴, Eric Strong⁴, Yingjie Weng⁵, Hannah Kerman^{6

7}, Joséphine A Cool^{6

7}, Zahir Kanjee^{6

7}, Andrew S Parsons⁸, Neera Ahuja⁴, Eric Horvitz^{9

10}, Daniel Yang¹¹, Arnold Milstein², Andrew P J Olson¹², Adam Rodman^{6

7}, Jonathan H Chen^{1

2

13}

Affiliations

¹ Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California.
² Stanford Clinical Excellence Research Center, Stanford University, Stanford, California.
³ Center for Innovation to Implementation, VA Palo Alto Health Care System, Palo Alto, California.
⁴ Department of Hospital Medicine, Stanford University School of Medicine, Stanford, California.
⁵ Quantitative Sciences Unit, Stanford University School of Medicine, Stanford, California.
⁶ Department of Hospital Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts.
⁷ Department of Hospital Medicine, Harvard Medical School, Boston, Massachusetts.
⁸ Department of Hospital Medicine, School of Medicine, University of Virginia, Charlottesville.
⁹ Microsoft Corp, Redmond, Washington.
¹⁰ Stanford Institute for Human-Centered Artificial Intelligence, Stanford, California.
¹¹ Department of Hospital Medicine, Kaiser Permanente, Oakland, California.
¹² Department of Hospital Medicine, University of Minnesota Medical School, Minneapolis.
¹³ Division of Hospital Medicine, Stanford University, Stanford, California.

PMID: 39466245
PMCID: PMC11519755
DOI: 10.1001/jamanetworkopen.2024.40969

Randomized Controlled Trial

Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial

Ethan Goh et al. JAMA Netw Open. 2024.

. 2024 Oct 1;7(10):e2440969.

doi: 10.1001/jamanetworkopen.2024.40969.

Authors

Affiliations

¹ Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California.
² Stanford Clinical Excellence Research Center, Stanford University, Stanford, California.
³ Center for Innovation to Implementation, VA Palo Alto Health Care System, Palo Alto, California.
⁴ Department of Hospital Medicine, Stanford University School of Medicine, Stanford, California.
⁵ Quantitative Sciences Unit, Stanford University School of Medicine, Stanford, California.
⁶ Department of Hospital Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts.
⁷ Department of Hospital Medicine, Harvard Medical School, Boston, Massachusetts.
⁸ Department of Hospital Medicine, School of Medicine, University of Virginia, Charlottesville.
⁹ Microsoft Corp, Redmond, Washington.
¹⁰ Stanford Institute for Human-Centered Artificial Intelligence, Stanford, California.
¹¹ Department of Hospital Medicine, Kaiser Permanente, Oakland, California.
¹² Department of Hospital Medicine, University of Minnesota Medical School, Minneapolis.
¹³ Division of Hospital Medicine, Stanford University, Stanford, California.

PMID: 39466245
PMCID: PMC11519755
DOI: 10.1001/jamanetworkopen.2024.40969

Abstract

Importance: Large language models (LLMs) have shown promise in their performance on both multiple-choice and open-ended medical reasoning examinations, but it remains unknown whether the use of such tools improves physician diagnostic reasoning.

Objective: To assess the effect of an LLM on physicians' diagnostic reasoning compared with conventional resources.

Design, setting, and participants: A single-blind randomized clinical trial was conducted from November 29 to December 29, 2023. Using remote video conferencing and in-person participation across multiple academic medical institutions, physicians with training in family medicine, internal medicine, or emergency medicine were recruited.

Intervention: Participants were randomized to either access the LLM in addition to conventional diagnostic resources or conventional resources only, stratified by career stage. Participants were allocated 60 minutes to review up to 6 clinical vignettes.

Main outcomes and measures: The primary outcome was performance on a standardized rubric of diagnostic performance based on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps, validated and graded via blinded expert consensus. Secondary outcomes included time spent per case (in seconds) and final diagnosis accuracy. All analyses followed the intention-to-treat principle. A secondary exploratory analysis evaluated the standalone performance of the LLM by comparing the primary outcomes between the LLM alone group and the conventional resource group.

Results: Fifty physicians (26 attendings, 24 residents; median years in practice, 3 [IQR, 2-8]) participated virtually as well as at 1 in-person site. The median diagnostic reasoning score per case was 76% (IQR, 66%-87%) for the LLM group and 74% (IQR, 63%-84%) for the conventional resources-only group, with an adjusted difference of 2 percentage points (95% CI, -4 to 8 percentage points; P = .60). The median time spent per case for the LLM group was 519 (IQR, 371-668) seconds, compared with 565 (IQR, 456-788) seconds for the conventional resources group, with a time difference of -82 (95% CI, -195 to 31; P = .20) seconds. The LLM alone scored 16 percentage points (95% CI, 2-30 percentage points; P = .03) higher than the conventional resources group.

Conclusions and relevance: In this trial, the availability of an LLM to physicians as a diagnostic aid did not significantly improve clinical reasoning compared with conventional resources. The LLM alone demonstrated higher performance than both physician groups, indicating the need for technology and workforce development to realize the potential of physician-artificial intelligence collaboration in clinical practice.

Trial registration: ClinicalTrials.gov Identifier: NCT06157944.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest Disclosures: Dr Kanjee reported book royalties from and paid membership on the Wolters Kluwer advisory board for medical education products, and personal fees from Oakstone Publishing for continuing medical education lectures on evidence-based medicine outside the submitted work. Dr Parsons reported receiving grants from the American Medical Association and the Southern Group on Educational Affairs outside the submitted work. Dr Yang reported being an employee of the Gordon and Betty Moore Foundation during the conduct of the study. Dr Milstein reported receiving personal fees for advisory board membership from the Peterson Center of Healthcare; holding stock options in Emsana Health, Amino Health, FNF Advisors, JRSL LLC, Embold, EZPT/Somatic Health, and Prealize outside the submitted work; and membership on the Leapfrog Group Board Intermountain Healthcare Board. Dr Olson reported receiving grants from 3M and the Agency for Healthcare Quality and Research outside the submitted work. Dr Chen reported receiving grants from the National Institutes of Health Nation Institute on (NIH) National Institute of Allergy and Infectious Diseases, NIH National Institute on Drug Abuse Clinical Trials Network, and the American Heart Association; nonfinancial support from Reaction Explorer LLC; personal fees from multiple legal offices as a medicolegal expert witness; grants from Google Inc and Stanford University; and personal fees from ISHI Health Consulting outside the submitted work. No other disclosures were reported.

Figures

See this image and copyright information in PMC

Update of

Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study.
Goh E, Gallo R, Hom J, Strong E, Weng Y, Kerman H, Cool J, Kanjee Z, Parsons AS, Ahuja N, Horvitz E, Yang D, Milstein A, Olson APJ, Rodman A, Chen JH. Goh E, et al. medRxiv [Preprint]. 2024 Mar 14:2024.03.12.24303785. doi: 10.1101/2024.03.12.24303785. medRxiv. 2024. Update in: JAMA Netw Open. 2024 Oct 1;7(10):e2440969. doi: 10.1001/jamanetworkopen.2024.40969. PMID: 38559045 Free PMC article. Updated. Preprint.
Large Language Model Influence on Management Reasoning: A Randomized Controlled Trial.
Goh E, Gallo R, Strong E, Weng Y, Kerman H, Freed J, Cool JA, Kanjee Z, Lane KP, Parsons AS, Ahuja N, Horvitz E, Yang D, Milstein A, Olson APJ, Hom J, Chen JH, Rodman A. Goh E, et al. medRxiv [Preprint]. 2024 Aug 7:2024.08.05.24311485. doi: 10.1101/2024.08.05.24311485. medRxiv. 2024. Update in: JAMA Netw Open. 2024 Oct 1;7(10):e2440969. doi: 10.1001/jamanetworkopen.2024.40969. PMID: 39148822 Free PMC article. Updated. Preprint.

References

1. Shojania KG, Burton EC, McDonald KM, Goldman L. Changes in rates of autopsy-detected diagnostic errors over time: a systematic review. JAMA. 2003;289(21):2849-2856. doi:10.1001/jama.289.21.2849 - DOI - PubMed
1. Singh H, Giardina TD, Meyer AND, Forjuoh SN, Reis MD, Thomas EJ. Types and origins of diagnostic errors in primary care settings. JAMA Intern Med. 2013;173(6):418-425. doi:10.1001/jamainternmed.2013.2777 - DOI - PMC - PubMed
1. Auerbach AD, Lee TM, Hubbard CC, et al. ; UPSIDE Research Group . Diagnostic errors in hospitalized adults who died or were transferred to intensive care. JAMA Intern Med. 2024;184(2):164-173. doi:10.1001/jamainternmed.2023.7347 - DOI - PMC - PubMed
1. Balogh EP, Miller BT, Ball JR, eds; Improving Diagnosis in Health Care. National Academies Press; December 29, 2015. doi:10.17226/21794 - DOI - PubMed
1. Newman-Toker DE, Peterson SM, Badihian S, et al. . Diagnostic errors in the emergency department: a systematic review. Agency for Healthcare Research and Quality. December 2022. report No.:22(23)-EHC043. Accessed September 23, 2024. https://www.ncbi.nlm.nih.gov/books/NBK588118/pdf/Bookshelf_NBK588118.pdf - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Associated data

Actions
- Search in PubMed
- Search in ClinicalTrials.gov

LinkOut - more resources

Full Text Sources
Medical
- ClinicalTrials.gov

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial

Affiliations

Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

References

Publication types

MeSH terms

Associated data

LinkOut - more resources

Full Text Sources

Medical