Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 8;15(1):8236.
doi: 10.1038/s41467-024-52415-1.

Evaluating the use of large language models to provide clinical recommendations in the Emergency Department

Affiliations

Evaluating the use of large language models to provide clinical recommendations in the Emergency Department

Christopher Y K Williams et al. Nat Commun. .

Abstract

The release of GPT-4 and other large language models (LLMs) has the potential to transform healthcare. However, existing research evaluating LLM performance on real-world clinical notes is limited. Here, we conduct a highly-powered study to determine whether LLMs can provide clinical recommendations for three tasks (admission status, radiological investigation(s) request status, and antibiotic prescription status) using clinical notes from the Emergency Department. We randomly selected 10,000 Emergency Department visits to evaluate the accuracy of zero-shot, GPT-3.5-turbo- and GPT-4-turbo-generated clinical recommendations across four different prompting strategies. We found that both GPT-4-turbo and GPT-3.5-turbo performed poorly compared to a resident physician, with accuracy scores 8% and 24%, respectively, lower than physician on average. Both LLMs tended to be overly cautious in its recommendations, with high sensitivity at the cost of specificity. Our findings demonstrate that, while early evaluations of the clinical use of LLMs are promising, LLM performance must be significantly improved before their deployment as decision support systems for clinical recommendations and other complex tasks.

PubMed Disclaimer

Conflict of interest statement

C.Y.K.W. has no conflicts of interest to disclose. B.Y.M. is an employee of SandboxAQ. A.E.K. is a co-founder and consultant to CaptureDx. A.J.B. is a co-founder and consultant to Personalis and NuMedii; consultant to Mango Tree Corporation, and in the recent past, Samsung, 10x Genomics, Helix, Pathway Genomics, and Verinata (Illumina); has served on paid advisory panels or boards for Geisinger Health, Regenstrief Institute, Gerson Lehman Group, AlphaSights, Covance, Novartis, Genentech, and Merck, and Roche; is a shareholder in Personalis and NuMedii; is a minor shareholder in Apple, Meta (Facebook), Alphabet (Google), Microsoft, Amazon, Snap, 10x Genomics, Illumina, Regeneron, Sanofi, Pfizer, Royalty Pharma, Moderna, Sutro, Doximity, BioNtech, Invitae, Pacific Biosciences, Editas Medicine, Nuna Health, Assay Depot, and Vet24seven, and several other non-health related companies and mutual funds; and has received honoraria and travel reimbursement for invited talks from Johnson and Johnson, Roche, Genentech, Pfizer, Merck, Lilly, Takeda, Varian, Mars, Siemens, Optum, Abbott, Celgene, AstraZeneca, AbbVie, Westat, and many academic institutions, medical or disease specific foundations and associations, and health systems. A.J.B. receives royalty payments through Stanford University, for several patents and other disclosures licensed to NuMedii and Personalis. A.J.B.’s research has been funded by NIH, Peraton (as the prime on an NIH contract), Genentech, Johnson and Johnson, FDA, Robert Wood Johnson Foundation, Leon Lowenstein Foundation, Intervalien Foundation, Priscilla Chan and Mark Zuckerberg, the Barbara and Gerson Bakar Foundation, and in the recent past, the March of Dimes, Juvenile Diabetes Research Foundation, California Governor’s Office of Planning and Research, California Institute for Regenerative Medicine, L’Oreal, and Progenity. None of these entities had any bearing on the design of this study or the writing of the manuscript.

Figures

Fig. 1
Fig. 1. Patient flowchart.
Flowchart of included Emergency Department visits and construction of both balanced (n = 10,000 samples) and unbalanced (n = 1000 samples reflecting the real-world distribution of patients presenting to the Emergency Department) datasets for the following outcomes: (1) admission status, (2) radiological investigation(s) status, and (3) antibiotic prescription status.
Fig. 2
Fig. 2. LLM performance: unbalanced n = 1000 sample.
Evaluation of physician and A GPT-3.5-turbo or B GPT-4-turbo accuracy across four iterations of prompt engineering [Prompt A-D] evaluated on an unbalanced n = 1000 sample reflective of the real-world distribution of clinical recommendations among patients presenting to ED, for the following three clinical recommendation tasks: (1) Should the patient be admitted to hospital; (2) Does the patient require radiological investigation; and (3) Does the patient require antibiotics. Source data are provided as a Source Data file.

Comment in

References

    1. Hu K., Hu K. ChatGPT sets record for fastest-growing user base–analyst note. Reuters. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-u... (2023).
    1. GPT-4. https://openai.com/gpt-4 (2023).
    1. OpenAI. GPT-4 Technical Report. arXiv. 10.48550/arXiv.2303.08774 (2023).
    1. Sarraju, A. et al. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA329, 842–844 (2023). - PMC - PubMed
    1. Ayers, J. W. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med.183, 589–596 (2023). - PMC - PubMed

Publication types