Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication
- PMID: 38837145
- PMCID: PMC11154375
- DOI: 10.1001/jamasurg.2024.1621
Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication
Abstract
Importance: General-domain large language models may be able to perform risk stratification and predict postoperative outcome measures using a description of the procedure and a patient's electronic health record notes.
Objective: To examine predictive performance on 8 different tasks: prediction of American Society of Anesthesiologists Physical Status (ASA-PS), hospital admission, intensive care unit (ICU) admission, unplanned admission, hospital mortality, postanesthesia care unit (PACU) phase 1 duration, hospital duration, and ICU duration.
Design, setting, and participants: This prognostic study included task-specific datasets constructed from 2 years of retrospective electronic health records data collected during routine clinical care. Case and note data were formatted into prompts and given to the large language model GPT-4 Turbo (OpenAI) to generate a prediction and explanation. The setting included a quaternary care center comprising 3 academic hospitals and affiliated clinics in a single metropolitan area. Patients who had a surgery or procedure with anesthesia and at least 1 clinician-written note filed in the electronic health record before surgery were included in the study. Data were analyzed from November to December 2023.
Exposures: Compared original notes, note summaries, few-shot prompting, and chain-of-thought prompting strategies.
Main outcomes and measures: F1 score for binary and categorical outcomes. Mean absolute error for numerical duration outcomes.
Results: Study results were measured on task-specific datasets, each with 1000 cases with the exception of unplanned admission, which had 949 cases, and hospital mortality, which had 576 cases. The best results for each task included an F1 score of 0.50 (95% CI, 0.47-0.53) for ASA-PS, 0.64 (95% CI, 0.61-0.67) for hospital admission, 0.81 (95% CI, 0.78-0.83) for ICU admission, 0.61 (95% CI, 0.58-0.64) for unplanned admission, and 0.86 (95% CI, 0.83-0.89) for hospital mortality prediction. Performance on duration prediction tasks was universally poor across all prompt strategies for which the large language model achieved a mean absolute error of 49 minutes (95% CI, 46-51 minutes) for PACU phase 1 duration, 4.5 days (95% CI, 4.2-5.0 days) for hospital duration, and 1.1 days (95% CI, 0.9-1.3 days) for ICU duration prediction.
Conclusions and relevance: Current general-domain large language models may assist clinicians in perioperative risk stratification on classification tasks but are inadequate for numerical duration predictions. Their ability to produce high-quality natural language explanations for the predictions may make them useful tools in clinical workflows and may be complementary to traditional risk prediction models.
Conflict of interest statement
Figures




Comment on
-
Travel Guide From the Brave New World of Artificial Intelligence.JAMA Surg. 2024 Aug 1;159(8):937-938. doi: 10.1001/jamasurg.2024.1645. JAMA Surg. 2024. PMID: 38837154 No abstract available.
References
-
- Brown TB, Mann B, Ryder N, et al. . Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20. Curran Associates Inc; 2020:1877-1901.
-
- Ouyang L, Wu J, Jiang X, et al. . Training language models to follow instructions with human feedback. arXiv [csCL]. Published online March 4, 2022. http://arxiv.org/abs/2203.02155
-
- Zhang X, Tian C, Yang X, Chen L, Li Z, Petzold LR. AlpaCare:instruction-tuned large language models for medical application. arXiv [csCL]. Published online October 23, 2023. http://arxiv.org/abs/2310.14558
-
- Taori R, Gulrajani I, Zhang T, et al. . Stanford alpaca: an instruction-following llama model. Accessed November 28, 2023. https://crfm.stanford.edu/2023/03/13/alpaca.html
-
- Agrawal M, Hegselmann S, Lang H, Kim Y, Sontag D. Large language models are few-shot clinical information extractors. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2022:1998-2022. doi:10.18653/v1/2022.emnlp-main.130 - DOI
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources