Evaluating the use of large language models to provide clinical recommendations in the Emergency Department
- PMID: 39379357
- PMCID: PMC11461960
- DOI: 10.1038/s41467-024-52415-1
Evaluating the use of large language models to provide clinical recommendations in the Emergency Department
Abstract
The release of GPT-4 and other large language models (LLMs) has the potential to transform healthcare. However, existing research evaluating LLM performance on real-world clinical notes is limited. Here, we conduct a highly-powered study to determine whether LLMs can provide clinical recommendations for three tasks (admission status, radiological investigation(s) request status, and antibiotic prescription status) using clinical notes from the Emergency Department. We randomly selected 10,000 Emergency Department visits to evaluate the accuracy of zero-shot, GPT-3.5-turbo- and GPT-4-turbo-generated clinical recommendations across four different prompting strategies. We found that both GPT-4-turbo and GPT-3.5-turbo performed poorly compared to a resident physician, with accuracy scores 8% and 24%, respectively, lower than physician on average. Both LLMs tended to be overly cautious in its recommendations, with high sensitivity at the cost of specificity. Our findings demonstrate that, while early evaluations of the clinical use of LLMs are promising, LLM performance must be significantly improved before their deployment as decision support systems for clinical recommendations and other complex tasks.
© 2024. The Author(s).
Conflict of interest statement
C.Y.K.W. has no conflicts of interest to disclose. B.Y.M. is an employee of SandboxAQ. A.E.K. is a co-founder and consultant to CaptureDx. A.J.B. is a co-founder and consultant to Personalis and NuMedii; consultant to Mango Tree Corporation, and in the recent past, Samsung, 10x Genomics, Helix, Pathway Genomics, and Verinata (Illumina); has served on paid advisory panels or boards for Geisinger Health, Regenstrief Institute, Gerson Lehman Group, AlphaSights, Covance, Novartis, Genentech, and Merck, and Roche; is a shareholder in Personalis and NuMedii; is a minor shareholder in Apple, Meta (Facebook), Alphabet (Google), Microsoft, Amazon, Snap, 10x Genomics, Illumina, Regeneron, Sanofi, Pfizer, Royalty Pharma, Moderna, Sutro, Doximity, BioNtech, Invitae, Pacific Biosciences, Editas Medicine, Nuna Health, Assay Depot, and Vet24seven, and several other non-health related companies and mutual funds; and has received honoraria and travel reimbursement for invited talks from Johnson and Johnson, Roche, Genentech, Pfizer, Merck, Lilly, Takeda, Varian, Mars, Siemens, Optum, Abbott, Celgene, AstraZeneca, AbbVie, Westat, and many academic institutions, medical or disease specific foundations and associations, and health systems. A.J.B. receives royalty payments through Stanford University, for several patents and other disclosures licensed to NuMedii and Personalis. A.J.B.’s research has been funded by NIH, Peraton (as the prime on an NIH contract), Genentech, Johnson and Johnson, FDA, Robert Wood Johnson Foundation, Leon Lowenstein Foundation, Intervalien Foundation, Priscilla Chan and Mark Zuckerberg, the Barbara and Gerson Bakar Foundation, and in the recent past, the March of Dimes, Juvenile Diabetes Research Foundation, California Governor’s Office of Planning and Research, California Institute for Regenerative Medicine, L’Oreal, and Progenity. None of these entities had any bearing on the design of this study or the writing of the manuscript.
Figures
Comment in
-
Beyond the AJR: Toward Large Language Models for Radiology Decision-Making in the Emergency Department.AJR Am J Roentgenol. 2025 Jul;225(1):e2432465. doi: 10.2214/AJR.24.32465. Epub 2025 Jul 16. AJR Am J Roentgenol. 2025. PMID: 39660831 No abstract available.
References
-
- Hu K., Hu K. ChatGPT sets record for fastest-growing user base–analyst note. Reuters. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-u... (2023).
-
- GPT-4. https://openai.com/gpt-4 (2023).
-
- OpenAI. GPT-4 Technical Report. arXiv. 10.48550/arXiv.2303.08774 (2023).
