Evaluating LLM-based generative AI tools in emergency triage: A comparative study of ChatGPT Plus, Copilot Pro, and triage nurses
- PMID: 39731895
- DOI: 10.1016/j.ajem.2024.12.024
Evaluating LLM-based generative AI tools in emergency triage: A comparative study of ChatGPT Plus, Copilot Pro, and triage nurses
Abstract
Background: The number of emergency department (ED) visits has been on steady increase globally. Artificial Intelligence (AI) technologies, including Large Language Model (LLMs)-based generative AI models, have shown promise in improving triage accuracy. This study evaluates the performance of ChatGPT and Copilot in triage at a high-volume urban hospital, hypothesizing that these tools can match trained physicians' accuracy and reduce human bias amidst ED crowding challenges.
Methods: This single-center, prospective observational study was conducted in an urban ED over one week. Adult patients were enrolled through random 24-h intervals. Exclusions included minors, trauma cases, and incomplete data. Triage nurses assessed patients while an emergency medicine (EM) physician documented clinical vignettes and assigned emergency severity index (ESI) levels. These vignettes were then introduced to ChatGPT and Copilot for comparison with the triage nurse's decision.
Results: The overall triage accuracy was 65.2 % for nurses, 66.5 % for ChatGPT, and 61.8 % for Copilot, with no significant difference (p = 0.000). Moderate agreement was observed between the EM physician and ChatGPT, triage nurses, and Copilot (Cohen's Kappa = 0.537, 0.477, and 0.472, respectively). In recognizing high-acuity patients, ChatGPT and Copilot outperformed triage nurses (87.8 % and 85.7 % versus 32.7 %, respectively). Compared to ChatGPT and Copilot, nurses significantly under-triaged patients (p < 0.05). The analysis of predictive performance for ChatGPT, Copilot, and triage nurses demonstrated varying discrimination abilities across ESI levels, all of which were statistically significant (p < 0.05). ChatGPT and Copilot exhibited consistent accuracy across age, gender, and admission time, whereas triage nurses were more likely to mistriage patients under 45 years old.
Conclusion: ChatGPT and Copilot outperform traditional nurse triage in identifying high-acuity patients, but real-time ED capacity data is crucial to prevent overcrowding and ensure high-quality of emergency care.
Keywords: ChatGPT; Copilot; Emergency medicine; Emergency severity index; Generative artificial intelligence; Large language models; Triage.
Copyright © 2024 Elsevier Inc. All rights reserved.
Conflict of interest statement
Declaration of competing interest The authors declare that there is no conflict of interest.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
