This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Aug 7:2024.08.05.24311485.

doi: 10.1101/2024.08.05.24311485.

Large Language Model Influence on Management Reasoning: A Randomized Controlled Trial

Ethan Goh^{1

2}, Robert Gallo³, Eric Strong⁴, Yingjie Weng⁵, Hannah Kerman^{6

7}, Jason Freed⁶, Joséphine A Cool^{6

7}, Zahir Kanjee^{6

7}, Kathleen P Lane⁸, Andrew S Parsons⁹, Neera Ahuja⁴, Eric Horvitz^{10

11}, Daniel Yang¹², Arnold Milstein², Andrew P J Olson⁸, Jason Hom⁴, Jonathan H Chen^{1

2

13}, Adam Rodman^{6

7}

Affiliations

¹ Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA.
² Stanford Clinical Excellence Research Center, Stanford University, Stanford, CA.
³ Center for Innovation to Implementation, VA Palo Alto Health Care System, PA, CA.
⁴ Stanford University School of Medicine, Stanford, CA.
⁵ Quantitative Sciences Unit, Stanford University School of Medicine, Stanford, CA.
⁶ Beth Israel Deaconess Medical Center, Boston, MA.
⁷ Harvard Medical School, Boston, MA.
⁸ University of Minnesota Medical School, Minneapolis, MN.
⁹ University of Virginia, School of Medicine, Charlottesville, VA.
¹⁰ Microsoft, Redmond, WA.
¹¹ Stanford Institute for Human-Centered Artificial Intelligence, Stanford, CA.
¹² Kaiser Permanente, Oakland, CA.
¹³ Division of Hospital Medicine, Stanford University, Stanford, CA.

PMID: 39148822
PMCID: PMC11326321
DOI: 10.1101/2024.08.05.24311485

Large Language Model Influence on Management Reasoning: A Randomized Controlled Trial

Ethan Goh et al. medRxiv. 2024.

[Preprint]. 2024 Aug 7:2024.08.05.24311485.

doi: 10.1101/2024.08.05.24311485.

Authors

Affiliations

¹ Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA.
² Stanford Clinical Excellence Research Center, Stanford University, Stanford, CA.
³ Center for Innovation to Implementation, VA Palo Alto Health Care System, PA, CA.
⁴ Stanford University School of Medicine, Stanford, CA.
⁵ Quantitative Sciences Unit, Stanford University School of Medicine, Stanford, CA.
⁶ Beth Israel Deaconess Medical Center, Boston, MA.
⁷ Harvard Medical School, Boston, MA.
⁸ University of Minnesota Medical School, Minneapolis, MN.
⁹ University of Virginia, School of Medicine, Charlottesville, VA.
¹⁰ Microsoft, Redmond, WA.
¹¹ Stanford Institute for Human-Centered Artificial Intelligence, Stanford, CA.
¹² Kaiser Permanente, Oakland, CA.
¹³ Division of Hospital Medicine, Stanford University, Stanford, CA.

PMID: 39148822
PMCID: PMC11326321
DOI: 10.1101/2024.08.05.24311485

Update in

Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.
Goh E, Gallo R, Hom J, Strong E, Weng Y, Kerman H, Cool JA, Kanjee Z, Parsons AS, Ahuja N, Horvitz E, Yang D, Milstein A, Olson APJ, Rodman A, Chen JH. Goh E, et al. JAMA Netw Open. 2024 Oct 1;7(10):e2440969. doi: 10.1001/jamanetworkopen.2024.40969. JAMA Netw Open. 2024. PMID: 39466245 Free PMC article. Clinical Trial.

Abstract

Importance: Large language model (LLM) artificial intelligence (AI) systems have shown promise in diagnostic reasoning, but their utility in management reasoning with no clear right answers is unknown.

Objective: To determine whether LLM assistance improves physician performance on open-ended management reasoning tasks compared to conventional resources.

Design: Prospective, randomized controlled trial conducted from 30 November 2023 to 21 April 2024.

Setting: Multi-institutional study from Stanford University, Beth Israel Deaconess Medical Center, and the University of Virginia involving physicians from across the United States.

Participants: 92 practicing attending physicians and residents with training in internal medicine, family medicine, or emergency medicine.

Intervention: Five expert-developed clinical case vignettes were presented with multiple open-ended management questions and scoring rubrics created through a Delphi process. Physicians were randomized to use either GPT-4 via ChatGPT Plus in addition to conventional resources (e.g., UpToDate, Google), or conventional resources alone.

Main outcomes and measures: The primary outcome was difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case.

Results: Physicians using the LLM scored higher compared to those using conventional resources (mean difference 6.5 %, 95% CI 2.7-10.2, p<0.001). Significant improvements were seen in management decisions (6.1%, 95% CI 2.5-9.7, p=0.001), diagnostic decisions (12.1%, 95% CI 3.1-21.0, p=0.009), and case-specific (6.2%, 95% CI 2.4-9.9, p=0.002) domains. GPT-4 users spent more time per case (mean difference 119.3 seconds, 95% CI 17.4-221.2, p=0.02). There was no significant difference between GPT-4-augmented physicians and GPT-4 alone (-0.9%, 95% CI -9.0 to 7.2, p=0.8).

Conclusions and relevance: LLM assistance improved physician management reasoning compared to conventional resources, with particular gains in contextual and patient-specific decision-making. These findings indicate that LLMs can augment management decision-making in complex cases.

Trial registration: ClinicalTrials.gov Identifier: NCT06208423; https://classic.clinicaltrials.gov/ct2/show/NCT06208423.

PubMed Disclaimer

Figures

**Figure 1:. Study Flow Diagram**
92 practicing attending physicians and residents with training in internal medicine, family medicine, or emergency medicine. Five expert-developed cases were presented, with scoring rubrics created through a Delphi process. Physicians were randomized to use either GPT-4 via ChatGPT Plus in addition to conventional resources (e.g., UpToDate, Google), or conventional resources alone. The primary outcome was difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case.

**Figure 2.**
Comparisons of the Primary Outcome by Physicians with LLM and with Conventional Resources Only (Total score standardized to 0–100)

See this image and copyright information in PMC

References

1. Kanjee Z, Crowe B, Rodman A. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA. 2023;330(1):78–80. doi: 10.1001/jama.2023.8288 - DOI - PMC - PubMed
1. Cabral S, Restrepo D, Kanjee Z, et al. Clinical Reasoning of a Generative Artificial Intelligence Model Compared With Physicians. JAMA Intern Med. 2024;184(5):581. doi: 10.1001/jamainternmed.2024.0295 - DOI - PMC - PubMed
1. Tu T, Palepu A, Schaekermann M, et al. Towards Conversational Diagnostic AI.
1. McDuff D, Schaekermann M, Tu T, et al. Towards Accurate Differential Diagnosis with Large Language Models. - PMC - PubMed
1. Goh E, Gallo R, Hom J, et al. Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study. medRxiv. Published online March 14, 2024. doi: 10.1101/2024.03.12.24303785 - DOI

Publication types

Actions

Associated data

Actions
- Search in PubMed
- Search in ClinicalTrials.gov

Grants and funding

UG1 DA015815/DA/NIDA NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Cold Spring Harbor Laboratory
- PubMed Central
Medical
- ClinicalTrials.gov
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Large Language Model Influence on Management Reasoning: A Randomized Controlled Trial

Affiliations

Large Language Model Influence on Management Reasoning: A Randomized Controlled Trial

Authors

Affiliations

Update in

Abstract

Figures

References

Publication types

Associated data

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous