Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep;30(9):2613-2622.
doi: 10.1038/s41591-024-03097-1. Epub 2024 Jul 4.

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Affiliations

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Paul Hager et al. Nat Med. 2024 Sep.

Abstract

Clinical decision-making is one of the most impactful parts of a physician's responsibilities and stands to benefit greatly from artificial intelligence solutions and large language models (LLMs) in particular. However, while LLMs have achieved excellent performance on medical licensing exams, these tests fail to assess many skills necessary for deployment in a realistic clinical decision-making environment, including gathering information, adhering to guidelines, and integrating into clinical workflows. Here we have created a curated dataset based on the Medical Information Mart for Intensive Care database spanning 2,400 real patient cases and four common abdominal pathologies as well as a framework to simulate a realistic clinical setting. We show that current state-of-the-art LLMs do not accurately diagnose patients across all pathologies (performing significantly worse than physicians), follow neither diagnostic nor treatment guidelines, and cannot interpret laboratory results, thus posing a serious risk to the health of patients. Furthermore, we move beyond diagnostic accuracy and demonstrate that they cannot be easily integrated into existing workflows because they often fail to follow instructions and are sensitive to both the quantity and order of information. Overall, our analysis reveals that LLMs are currently not ready for autonomous clinical decision-making while providing a dataset and framework to guide future studies.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of dataset creation and evaluation framework.
a, To properly evaluate LLMs for clinical decision-making in realistic conditions, we created a curated dataset from real-world cases derived from the MIMIC-IV database, which contains comprehensive electronic health record data recorded during hospital admissions. b, Our evaluation framework reflects a realistic clinical setting and thoroughly evaluates LLMs across multiple criteria, including diagnostic accuracy, adherence to diagnostic and treatment guidelines, consistency in following instructions, ability to interpret laboratory results, and robustness to changes in instruction, information quantity and information order. ICD, International Classification of Diseases; CT, computed tomography; US, ultrasound; MRCP, magnetic resonance cholangiopancreatography.
Fig. 2
Fig. 2. LLMs diagnose significantly worse than doctors when provided with all information.
On a subset (n = 80) of MIMIC-CDM-FI, we compared the mean diagnostic accuracy of LLMs over multiple seeds (n = 20) with clinicians (n = 4) and found that LLMs perform significantly worse on average (P < 0.001) and especially on cholecystitis (P < 0.001) and diverticulitis (P < 0.001). The mean diagnostic accuracy is shown above each bar. Vertical lines indicate the standard deviation. The individual data points are shown.
Fig. 3
Fig. 3. Diagnostic accuracy of LLMs decreased in an autonomous clinical decision-making scenario.
When tasked with gathering all information required for clinical decision-making themselves, LLMs perform best when diagnosing appendicitis but perform poorly on the other three pathologies of cholecystitis, diverticulitis and pancreatitis. In such a realistic clinical scenario, model performance decreased compared to the retrospective diagnosis with all information provided (MIMIC-CDM-FI). The exact diagnostic accuracy is shown above each bar.
Fig. 4
Fig. 4. LLMs do not consistently recommend essential and patient-specific treatment.
Expected treatments were determined based on clinical guidelines and actual treatments received by patients in the dataset. Models fail to recommend appropriate treatments especially for patients with more severe forms of the pathologies. We only scored models on the subset of patients that they correctly diagnosed and that actually received a specific treatment. For example, of the 957 patients with appendicitis, 808 received an appendectomy (indicated below the treatment name). Of those 808 patients, Llama 2 Chat correctly diagnosed 603 (indicated below the Llama 2 Chat bar). Of those 603 patients, Llama 2 Chat correctly recommended an appendectomy 97.5% of the time. ERCP, endoscopic retrograde cholangiopancreatography.
Fig. 5
Fig. 5. LLMs are sensitive to the quantity of information provided.
We compared the performance of each model using all diagnostic information to using only a single diagnostic exam in addition to the history of present illness. For almost all diseases, providing all information does not lead to the best performance on the MIMIC-CDM-FI dataset. This suggests that LLMs cannot focus on the key facts and degrade in performance when too much information is provided. This poses a problem in the clinic where an abundance of information is typically gathered to holistically understand the patient’s health and being able to focus on key facts is an essential skill. The gray theoretical best line shows the mean performance if a clinician were to select the best diagnostic test for each pathology. HPI, history of present illness.
Fig. 6
Fig. 6. LLMs are sensitive to the order of information.
By changing the order in which diagnostic information from MIMIC-CDM-FI is presented to LLMs, their diagnostic accuracy changes despite the information included staying the same. This places an unnecessary burden upon clinicians who would need to make preliminary diagnoses to decide the order in which they feed the models with information for best performance.
Extended Data Fig. 1
Extended Data Fig. 1. Model performance on the MIMIC-CDM-FI dataset.
LLMs perform best on the MIMIC-CDM-FI dataset where all information required for a diagnosis is provided, especially on pathologies with strong indications such as appendicitis (dilated appendix described in radiologist report) and pancreatitis (elevated pancreatic enzymes listed in laboratory test results).
Extended Data Fig. 2
Extended Data Fig. 2. LLMs fail to consistently ask for a physical examination.
The diagnostic guidelines of each disease require a physical examination to be performed as the first action. Llama 2 Chat was the only LLM that consistently requested physical examinations. (Late) Physical Examinations counted the physical examination if it wasn’t the first information requested.
Extended Data Fig. 3
Extended Data Fig. 3. LLMs often do not order the necessary laboratory tests required to establish a diagnosis.
The tests, defined by current diagnostic guidelines, help differentiate abdominal pathologies, as results can indicate which organ is currently pathologically afflicted or functioning normally. When testing the presence of the necessary tests in the MIMIC-CDM dataset, we find that the MIMIC Doctors requested all necessary tests in every patient case.
Extended Data Fig. 4
Extended Data Fig. 4. LLMs are incapable of interpreting lab results.
To test the ability of LLMs to interpret laboratory data, we provided each laboratory test result and its reference range and asked the model to classify the result as below the reference range (low), within the range (normal) or above the range (high). We found that LLMs are incapable of consistently interpreting the result as normal, low or high, despite being provided with all required information. The models performed especially poorly on abnormal results which are of particular importance to establishing a diagnosis.
Extended Data Fig. 5
Extended Data Fig. 5. The first imaging modality requested by the LLMs and the doctors in the MIMIC dataset.
LLMs sometimes follow diagnostic guidelines concerning imaging but often diagnose without requesting any imaging at all. As we show that imaging is the most useful diagnostic tool for all LLMs for each pathology except pancreatitis, this could be partly responsible for their low diagnostic accuracy. The legend specifies the colors of the imaging modalities and the patterns of the models.
Extended Data Fig. 6
Extended Data Fig. 6. LLMs struggle to follow instructions.
During the autonomous clinical decision making task, LLMs often introduce errors when providing the next action to take and hallucinate non-existent tools, up to once every two patients. Formatting errors while providing the diagnosis also regularly occur. In the clinic, extensive manual supervision would be required to ensure proper performance.
Extended Data Fig. 7
Extended Data Fig. 7. Small changes in instruction phrasing changes diagnostic accuracy.
Often small changes in instructions, such as changing final diagnosis to main diagnosis or primary diagnosis, greatly affects the performance of the LLMs on the MIMIC-CDM-FI dataset. This would vary the quality of responses received depending on who is using the model.
Extended Data Fig. 8
Extended Data Fig. 8. Filtering for only abnormal results generally improves LLM performance.
Filtering to include only abnormal laboratory results using the laboratory reference ranges provided in MIMIC-IV database generally improves model performance, especially for the cholecystitis pathology. This allows the model to focus more on abnormal, pathological signals. While this approach is appropriate for models as they function today, healthy laboratory test results are an important source of information for clinicians and should not degrade model performance.

References

    1. Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med.29, 1930–1940 (2023). - PubMed
    1. Lee, S. et al. LLM-CXR: instruction-finetuned LLM for CXR image understanding and generation. In 12th International Conference on Learning Representations (ICLR, 2024).
    1. Van Veen, D. et al. RadAdapt: radiology report summarization via lightweight domain adaptation of large language models. In Proc. 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks (eds Demner-fushman, D. et al.) 449–460 (Association for Computational Linguistics, 2023).
    1. Tu, T. et al. Towards generalist biomedical AI. NEJM AI1, AIoa2300138 (2024). 10.1056/AIoa2300138 - DOI
    1. Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med.30, 1134–1142 (2024). - PMC - PubMed

LinkOut - more resources