Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comment
. 2024 Aug 1;159(8):928-937.
doi: 10.1001/jamasurg.2024.1621.

Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication

Affiliations
Comment

Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication

Philip Chung et al. JAMA Surg. .

Abstract

Importance: General-domain large language models may be able to perform risk stratification and predict postoperative outcome measures using a description of the procedure and a patient's electronic health record notes.

Objective: To examine predictive performance on 8 different tasks: prediction of American Society of Anesthesiologists Physical Status (ASA-PS), hospital admission, intensive care unit (ICU) admission, unplanned admission, hospital mortality, postanesthesia care unit (PACU) phase 1 duration, hospital duration, and ICU duration.

Design, setting, and participants: This prognostic study included task-specific datasets constructed from 2 years of retrospective electronic health records data collected during routine clinical care. Case and note data were formatted into prompts and given to the large language model GPT-4 Turbo (OpenAI) to generate a prediction and explanation. The setting included a quaternary care center comprising 3 academic hospitals and affiliated clinics in a single metropolitan area. Patients who had a surgery or procedure with anesthesia and at least 1 clinician-written note filed in the electronic health record before surgery were included in the study. Data were analyzed from November to December 2023.

Exposures: Compared original notes, note summaries, few-shot prompting, and chain-of-thought prompting strategies.

Main outcomes and measures: F1 score for binary and categorical outcomes. Mean absolute error for numerical duration outcomes.

Results: Study results were measured on task-specific datasets, each with 1000 cases with the exception of unplanned admission, which had 949 cases, and hospital mortality, which had 576 cases. The best results for each task included an F1 score of 0.50 (95% CI, 0.47-0.53) for ASA-PS, 0.64 (95% CI, 0.61-0.67) for hospital admission, 0.81 (95% CI, 0.78-0.83) for ICU admission, 0.61 (95% CI, 0.58-0.64) for unplanned admission, and 0.86 (95% CI, 0.83-0.89) for hospital mortality prediction. Performance on duration prediction tasks was universally poor across all prompt strategies for which the large language model achieved a mean absolute error of 49 minutes (95% CI, 46-51 minutes) for PACU phase 1 duration, 4.5 days (95% CI, 4.2-5.0 days) for hospital duration, and 1.1 days (95% CI, 0.9-1.3 days) for ICU duration prediction.

Conclusions and relevance: Current general-domain large language models may assist clinicians in perioperative risk stratification on classification tasks but are inadequate for numerical duration predictions. Their ability to produce high-quality natural language explanations for the predictions may make them useful tools in clinical workflows and may be complementary to traditional risk prediction models.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest Disclosures: Dr Walters reported receiving consulting fees from Sonosite and Philips outside the submitted work. Dr O’Reilly-Shah reported being an equity holder of Doximity Inc outside the submitted work. No other disclosures were reported.

Figures

Figure 1.
Figure 1.. Overview of Experimental Apparatus
Overview of the experimental apparatus. Each task-specific dataset is divided into an inference dataset of query cases and a few-shot dataset used to construct few-shot prompts in an 80%-20% split. GPT-4 Turbo (OpenAI) is used as the large language model (LLM) in all steps. Each prompt to the LLM is unique based on the task, prompt strategy, and query case for which an answer and explanation are generated. Unplanned admission refers to patients who were planned for outpatient surgery but were actually admitted postoperatively. Hospital mortality refers to postoperative in-hospital mortality and not 30-day mortality. Zero-shot prompt strategy is conducted with both original clinical notes and a summary of the clinical notes. Few-shot prompts include example demonstrations from the few-shot dataset. Each few-shot demonstration is a question, procedure description, summary of patient notes, and answer. Summaries are generated using the LLM. The few-shot chain-of-thought (CoT) prompt strategy requires a CoT rationale for each few-shot demonstration that links the question to the answer, which is also generated using the LLM. Few-shot demonstrations are dynamically selected for each query case with inverse frequency sampling to balance the distribution of answers of few-shot demonstrations. Answers provided by the LLM are compared against the ground-truth label derived from electronic health record (EHR) data, and either an F1 score or mean absolute error is computed, depending on whether the outcome variable for the task is categorical/binary or numerical. ASA-PS indicates American Society of Anesthesiologists Physical Status; ICU, intensive care unit; PACU, postanesthesia care unit.
Figure 2.
Figure 2.. Annotated Example Prompt and Large Language Model (LLM) Response
A prompt and LLM output for zero-shot chain-of-thought (CoT) question and answer (Q&A) from notes summary prompt strategy. A, Prompt with highlights to show how data from cases and notes are inserted into the prompt template. Black text is the template for the specific prompt strategy, which is the same for all tasks with the exception of variable type being substituted in the response specification (eg, “” may be “” for binary prediction tasks). Procedure information is inserted into the prompt template without modification. Note summaries are generated from clinical notes in a separate step using the LLM and then the summary is inserted into the prompt template. B, LLM output explanation shows that the LLM understands the definition for American Society of Anesthesiologists Physical Status (ASA-PS) classification and provides a valid rationale for classifying the patient’s ASA-PS. Because LLMs are left-to-right causal language models, CoT prompt strategies always request generation of the step-by-step explanation before the final answer to ensure the LLM considers the explanation when generating the final answer. Although the content of this example is derived from a real patient and case from the electronic health record, all protected health information and personally identifiable information are removed with names obfuscated and dates and times shifted. More detail on all prompt strategies used in experiments including prompts used to generate summaries and CoT rationales are depicted in eFigure 2 in Supplement 1. A-Fib indicates atrial fibrillation; HCC, Hierarchical Condition Category; JSON, JavaScript Object Notation; LVEF, left ventricular ejection fraction.
Figure 3.
Figure 3.. Association of Prompt Strategy With Binary and Categorical Prediction Tasks
The large language model (LLM) GPT-4 Turbo (OpenAI) prediction performance on the 5 binary and categorical prediction tasks. The x-axis shows the different prompt strategies with the first 6 prompt strategies without chain-of-thought (CoT) reasoning and the second 6 with CoT reasoning. “N” indicates that original clinical notes were inserted into the prompt. The remaining prompt strategy numbers are where clinical notes were first summarized using the LLM and then the summary was inserted into the prompt. “0” Corresponds to “0-shot,” which indicates that zero-shot prompt strategy was used. “5” Corresponds to “5-shot,” “10” with “10-shot,” “20” with “20-shot,” and “50” with “50-shot”; all refer to few-shot prompts with the respective number of few-shot demonstrations. All few-shot prompts used note summaries for both few-shot demonstrations and query case. y-axis is F1 score for classification tasks where higher score is better. Baseline for each task is different and represents the score achieved by random guessing. The clinical notes are stratified into short, medium, and long length groups, which represent the 1/3 shortest, 1/3 middle, and 1/3 longest notes by token count (word subunits used by LLMs, 1 token approximately equals 3/4 word) and performance is shown for each stratification. CIs are omitted for legibility but are available in eTables 4 to 21 in Supplement 2. ASA-PS indicates American Society of Anesthesiologists Physical Status; ICU, intensive care unit; PACU, postanesthesia care unit.
Figure 4.
Figure 4.. Effect of Prompt Strategy on Numerical Prediction Tasks
The large language model (LLM) GPT-4 Turbo (Open AI) prediction performance on the 3 numerical prediction tasks. The x-axis shows the different prompt strategies with the first 6 prompt strategies without chain-of-thought (CoT) reasoning and the second 6 with CoT reasoning. “N” indicates that original clinical notes were inserted into the prompt. The remaining prompt strategy numbers are where clinical notes were first summarized using the LLM and then the summary was inserted into the prompt. “0” corresponds to “0-Shot,” which indicates that zero-shot prompt strategy was used. “5” Corresponds with “5-shot,” “10” with “10-shot,” “20” with “20-shot,” “50” with “50-shot”; all refer to few-shot prompts with the respective number of few-shot demonstrations. All few-shot prompts used note summaries for both few-shot demonstrations and query case. y-axis is mean absolute error (MAE) for numerical prediction tasks where lower error is better. The baseline for numerical prediction tasks represents the MAE achieved by a dummy regressor that always predicts the mean outcome value in the dataset. The clinical notes are stratified into short, medium, and long length groups, which represent the 1/3 shortest, 1/3 middle, and 1/3 longest notes by token count (word subunits used by LLMs, 1 token approximately equals 3/4 word) and performance is shown for each stratification. CIs are omitted for legibility but are available in eTables 22 to 24 in Supplement 2. ICU indicates intensive care unit; PACU, postanesthesia care unit.

Comment on

References

    1. Brown TB, Mann B, Ryder N, et al. . Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20. Curran Associates Inc; 2020:1877-1901.
    1. Ouyang L, Wu J, Jiang X, et al. . Training language models to follow instructions with human feedback. arXiv [csCL]. Published online March 4, 2022. http://arxiv.org/abs/2203.02155
    1. Zhang X, Tian C, Yang X, Chen L, Li Z, Petzold LR. AlpaCare:instruction-tuned large language models for medical application. arXiv [csCL]. Published online October 23, 2023. http://arxiv.org/abs/2310.14558
    1. Taori R, Gulrajani I, Zhang T, et al. . Stanford alpaca: an instruction-following llama model. Accessed November 28, 2023. https://crfm.stanford.edu/2023/03/13/alpaca.html
    1. Agrawal M, Hegselmann S, Lang H, Kim Y, Sontag D. Large language models are few-shot clinical information extractors. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2022:1998-2022. doi:10.18653/v1/2022.emnlp-main.130 - DOI