Diagnostic performance of newly developed large language models in critical illness cases: A comparative study
- PMID: 40865411
- DOI: 10.1016/j.ijmedinf.2025.106088
Diagnostic performance of newly developed large language models in critical illness cases: A comparative study
Abstract
Background: Large language models (LLMs) are increasingly used in clinical decision support, and newly developed models have demonstrated promising potential, yet their diagnostic performance for critically ill patients in intensive care unit (ICU) settings remains underexplored. This study evaluated the diagnostic accuracy, differential diagnosis quality, and response quality in critical illness cases of four newly developed LLMs.
Methods: In this cross-sectional comparative study, four newly developed LLMs-ChatGPT-4o, ChatGPT-o3, DeepSeek-V3, and DeepSeek-R1-were evaluated using 50 critical illness cases in ICU settings from published literature. Diagnostic accuracy and response quality were compared across models.
Results: A total of 50 critical illness cases were included. ChatGPT-o3 achieved the top diagnosis accuracy at 72 % (36/50; 95 % CI 0.600-0.840), followed by DeepSeek-R1 at 68 % (34/50; 95 % CI 0.540-0.800), ChatGPT-4o at 64 % (32/50; 95 % CI 0.500-0.760), and DeepSeek-V3 at 32 % (16/50; 95 % CI 0.200-0.460). ChatGPT-o3, DeepSeek-R1, and ChatGPT-4o all significantly outperformed DeepSeek-V3, with no significant differences among the three. The median differential quality score was 5.0 for ChatGPT-o3 (IQR 5.0-5.0; 95 % CI 5.0-5.0), DeepSeek-R1 (IQR 5.0-5.0; 95 % CI 5.0-5.0), and ChatGPT-4o (IQR 4.0-5.0; 95 % CI 4.5-5.0), and 4.0 for DeepSeek-V3 (IQR 3.0-5.0; 95 % CI 4.0-5.0). ChatGPT-o3 and DeepSeek-R1 scored significantly higher than DeepSeek-V3; ChatGPT-4o showed a non-significant trend toward better performance.All models received high Likert ratings for response completeness, clarity, and usefulness. ChatGPT-o3, DeepSeek-R1, and ChatGPT-4o each showed a trend toward better response quality compared to DeepSeek-V3, although no significant differences were observed among the models.
Conclusions: The newly developed models, especially the reasoning models, demonstrated strong potential in supporting diagnosis in critical illness cases in ICU settings. With further domain-specific fine-tuning, their diagnostic accuracy could be further enhanced. Notably, the open-source reasoning model DeepSeek-R1 performed competitively, suggesting strong potential for scalable deployment in resource-limited settings.
Keywords: Artificial intelligence; DeepSeek; Differential diagnosis; Intensive care unit; Large language model.
Copyright © 2025 The Author(s). Published by Elsevier B.V. All rights reserved.
Conflict of interest statement
Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Similar articles
-
Evaluating ChatGPT and DeepSeek in postdural puncture headache management: a comparative study with international consensus guidelines.BMC Neurol. 2025 Jul 1;25(1):264. doi: 10.1186/s12883-025-04280-8. BMC Neurol. 2025. PMID: 40597769 Free PMC article.
-
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929. J Med Internet Res. 2025. PMID: 40532199 Free PMC article.
-
Assessing the Role of Large Language Models Between ChatGPT and DeepSeek in Asthma Education for Bilingual Individuals: Comparative Study.JMIR Med Inform. 2025 Aug 13;13:e65365. doi: 10.2196/65365. JMIR Med Inform. 2025. PMID: 40802989 Free PMC article.
-
Exercise rehabilitation following intensive care unit discharge for recovery from critical illness.Cochrane Database Syst Rev. 2015 Jun 22;2015(6):CD008632. doi: 10.1002/14651858.CD008632.pub2. Cochrane Database Syst Rev. 2015. PMID: 26098746 Free PMC article.
-
Automated monitoring compared to standard care for the early detection of sepsis in critically ill patients.Cochrane Database Syst Rev. 2018 Jun 25;6(6):CD012404. doi: 10.1002/14651858.CD012404.pub2. Cochrane Database Syst Rev. 2018. PMID: 29938790 Free PMC article.
LinkOut - more resources
Full Text Sources