Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 1;7(5):e248895.
doi: 10.1001/jamanetworkopen.2024.8895.

Use of a Large Language Model to Assess Clinical Acuity of Adults in the Emergency Department

Affiliations

Use of a Large Language Model to Assess Clinical Acuity of Adults in the Emergency Department

Christopher Y K Williams et al. JAMA Netw Open. .

Abstract

Importance: The introduction of large language models (LLMs), such as Generative Pre-trained Transformer 4 (GPT-4; OpenAI), has generated significant interest in health care, yet studies evaluating their performance in a clinical setting are lacking. Determination of clinical acuity, a measure of a patient's illness severity and level of required medical attention, is one of the foundational elements of medical reasoning in emergency medicine.

Objective: To determine whether an LLM can accurately assess clinical acuity in the emergency department (ED).

Design, setting, and participants: This cross-sectional study identified all adult ED visits from January 1, 2012, to January 17, 2023, at the University of California, San Francisco, with a documented Emergency Severity Index (ESI) acuity level (immediate, emergent, urgent, less urgent, or nonurgent) and with a corresponding ED physician note. A sample of 10 000 pairs of ED visits with nonequivalent ESI scores, balanced for each of the 10 possible pairs of 5 ESI scores, was selected at random.

Exposure: The potential of the LLM to classify acuity levels of patients in the ED based on the ESI across 10 000 patient pairs. Using deidentified clinical text, the LLM was queried to identify the patient with a higher-acuity presentation within each pair based on the patients' clinical history. An earlier LLM was queried to allow comparison with this model.

Main outcomes and measures: Accuracy score was calculated to evaluate the performance of both LLMs across the 10 000-pair sample. A 500-pair subsample was manually classified by a physician reviewer to compare performance between the LLMs and human classification.

Results: From a total of 251 401 adult ED visits, a balanced sample of 10 000 patient pairs was created wherein each pair comprised patients with disparate ESI acuity scores. Across this sample, the LLM correctly inferred the patient with higher acuity for 8940 of 10 000 pairs (accuracy, 0.89 [95% CI, 0.89-0.90]). Performance of the comparator LLM (accuracy, 0.84 [95% CI, 0.83-0.84]) was below that of its successor. Among the 500-pair subsample that was also manually classified, LLM performance (accuracy, 0.88 [95% CI, 0.86-0.91]) was comparable with that of the physician reviewer (accuracy, 0.86 [95% CI, 0.83-0.89]).

Conclusions and relevance: In this cross-sectional study of 10 000 pairs of ED visits, the LLM accurately identified the patient with higher acuity when given pairs of presenting histories extracted from patients' first ED documentation. These findings suggest that the integration of an LLM into ED workflows could enhance triage processes while maintaining triage quality and warrants further investigation.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest Disclosures: Ms Miao reported receiving personal fees from SandboxAQ outside the submitted work. Dr Kornblith reported being a cofounder of Capture Diagnostics LLC outside the submitted work. Dr Butte reported being a cofounder of and consulting for Personalis Inc and NuMedii Inc; consulting for Mango Tree Corp, Samsung Electronics Co Ltd, 10x Genomics Inc, Helix Inc, Pathway Genomics, and Verinata Health Inc (Illumina Inc); serving on paid advisory panels or boards for Geisinger Health, Regenstrief Institute, Gerson Lehman Group, AlphaSights, Covance, Novartis AG, Genentech Inc, Merck & Co Inc, and Roche; being a shareholder of Personalis Inc and NuMedii Inc; being a minor shareholder of Apple Inc, Meta (Facebook), Alphabet Inc (Google), Microsoft Corp, Amazon, Snap Inc, 10x Genomics Inc, Illumina Inc, Regeneron Pharmaceuticals Inc, Sanofi SA, Pfizer Inc, Royalty Pharma PLC, Moderna Inc, Sutro Biopharma Inc, Doximity, BioNTech SA, Invitae Corp, Pacific Biosciences of California Inc, Editas Medicine Inc, Nuna Inc, Assay Depot, Vet24seven Inc, Sophia Genetics, Allbirds Inc, Coursera Plus, DigitalOcean Holdings Inc, Rivian Automotive Inc, Snowflake Inc, Netflix Inc, Starbucks Corp, Advanced Micro Devices Inc, Tesla Inc, Personalis Inc, and Eli Lilly and Co; receiving honoraria and travel reimbursement for invited talks from Johnson & Johnson, Roche, Genentech Inc, Pfizer Inc, Merck & Co Inc, Eli Lilly and Co Inc, Takeda Pharmaceutical Co, Varian Medical Systems, Mars Therapeutics Private Limited, Siemens AG, Optum Inc, Abbott Laboratories, Celgene Corp, AstraZeneca, AbbVie Inc, Westat, Boston Children’s Hospital, The Johns Hopkins University, Endocrine Society, Alliance for Academic Internal Medicine, Children’s Hospital of Philadelphia, University of Pittsburgh Medical Center, Cleveland Clinic, University of Utah, Society of Toxicology, Mayo Clinic, Oracle Cerner, and the Transplantation Society; receiving royalty payments through Stanford University for several patents and other disclosures licensed to NuMedii Inc and Personalis Inc; and receiving research funding from the National Institutes of Health, Peraton Inc, Genentech Inc, Johnson & Johnson, the US Food and Drug Administration, the Robert Wood Johnson Foundation, the Leon Lowenstein Foundation, the Intervalien Foundation, Priscilla Chan and Mark Zuckerberg, the Barbara and Gerson Bakar Foundation, the March of Dimes, the Juvenile Diabetes Research Foundation, the California Governor’s Office of Planning and Research, the California Institute for Regenerative Medicine, L’Oréal SA, and Progenity. No other disclosures were reported.

Figures

Figure 1.
Figure 1.. Flowchart of Included Emergency Department (ED) Visits
ESI indicates Emergency Severity Index (immediate, emergent, urgent, less urgent, and nonurgent). aA balanced sample of 10 000 patient pairs was created from the full sample wherein each pair comprised patients with the following disparate ESI acuity scores: immediate/emergent (n = 1000); immediate/urgent (n = 1000); immediate/less urgent (n = 1000); immediate/nonurgent (n = 1000); emergent/urgent (n = 1000); emergent/less urgent (n = 1000); emergent/nonurgent (n = 1000); urgent/less urgent (n = 1000); urgent/nonurgent (n = 1000); less urgent/nonurgent (n = 1000).
Figure 2.
Figure 2.. Comparison of Large Language Model (LLM) and Physician Performance
Evaluated for each type of Emergency Severity Index (ESI) acuity level pairing in the 500-pair subsample (immediate, emergent, urgent, less urgent, and nonurgent). Overall LLM accuracy was 0.88 (95% CI, 0.86-0.91); overall physician accuracy, 0.86 (95% CI, 0.83-0.89). Error bars indicate 95% CIs.
Figure 3.
Figure 3.. Comparison of Comparator Large Language Model (LLM) and Physician Performance
Evaluated for each type of Emergency Severity Index (ESI) acuity level pairing in the 500-pair subsample (immediate, emergent, urgent, less urgent, and nonurgent). Overall comparator LLM accuracy was 0.84 (95% CI, 0.81-0.88); overall physician accuracy, 0.86 (95% CI, 0.83-0.89). Error bars indicate 95% CIs.

Comment in

References

    1. OpenAI . Introducing ChatGPT. Accessed March 18, 2023. https://openai.com/blog/chatgpt
    1. OpenAI, Achiam J, Adler S, et al. . GPT-4 technical report. arXiv. Preprint posted online March 27, 2023. doi:10.48550/arXiv.2303.08774 - DOI
    1. Kung TH, Cheatham M, Medenilla A, et al. . Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. doi:10.1371/journal.pdig.0000198 - DOI - PMC - PubMed
    1. Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023;330(1):78-80. doi:10.1001/jama.2023.8288 - DOI - PMC - PubMed
    1. Ilgen JS, Humbert AJ, Kuhn G, et al. . Assessing diagnostic reasoning: a consensus statement summarizing theory, practice, and future needs. Acad Emerg Med. 2012;19(12):1454-1461. doi:10.1111/acem.12034 - DOI - PubMed

Publication types