External validation of AI-based scoring systems in the ICU: a systematic review and meta-analysis
- PMID: 39762808
- PMCID: PMC11702098
- DOI: 10.1186/s12911-024-02830-7
External validation of AI-based scoring systems in the ICU: a systematic review and meta-analysis
Abstract
Background: Machine learning (ML) is increasingly used to predict clinical deterioration in intensive care unit (ICU) patients through scoring systems. Although promising, such algorithms often overfit their training cohort and perform worse at new hospitals. Thus, external validation is a critical - but frequently overlooked - step to establish the reliability of predicted risk scores to translate them into clinical practice. We systematically reviewed how regularly external validation of ML-based risk scores is performed and how their performance changed in external data.
Methods: We searched MEDLINE, Web of Science, and arXiv for studies using ML to predict deterioration of ICU patients from routine data. We included primary research published in English before December 2023. We summarised how many studies were externally validated, assessing differences over time, by outcome, and by data source. For validated studies, we evaluated the change in area under the receiver operating characteristic (AUROC) attributable to external validation using linear mixed-effects models.
Results: We included 572 studies, of which 84 (14.7%) were externally validated, increasing to 23.9% by 2023. Validated studies made disproportionate use of open-source data, with two well-known US datasets (MIMIC and eICU) accounting for 83.3% of studies. On average, AUROC was reduced by -0.037 (95% CI -0.052 to -0.027) in external data, with more than 0.05 reduction in 49.5% of studies.
Discussion: External validation, although increasing, remains uncommon. Performance was generally lower in external data, questioning the reliability of some recently proposed ML-based scores. Interpretation of the results was challenged by an overreliance on the same few datasets, implicit differences in case mix, and exclusive use of AUROC.
Keywords: Acute deterioration; Electronic health records; External validation; Intensive care unit; Machine learning.
© 2024. The Author(s).
Conflict of interest statement
Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.
Figures




Similar articles
-
Illness severity assessment of older adults in critical illness using machine learning (ELDER-ICU): an international multicentre study with subgroup bias evaluation.Lancet Digit Health. 2023 Oct;5(10):e657-e667. doi: 10.1016/S2589-7500(23)00128-0. Epub 2023 Aug 18. Lancet Digit Health. 2023. PMID: 37599147
-
Mortality prediction in intensive care units with the Super ICU Learner Algorithm (SICULA): a population-based study.Lancet Respir Med. 2015 Jan;3(1):42-52. doi: 10.1016/S2213-2600(14)70239-5. Epub 2014 Nov 24. Lancet Respir Med. 2015. PMID: 25466337 Free PMC article.
-
Construction and evaluation of a mortality prediction model for patients with acute kidney injury undergoing continuous renal replacement therapy based on machine learning algorithms.Ann Med. 2024 Dec;56(1):2388709. doi: 10.1080/07853890.2024.2388709. Epub 2024 Aug 19. Ann Med. 2024. PMID: 39155811 Free PMC article.
-
Prognostic models for newly-diagnosed chronic lymphocytic leukaemia in adults: a systematic review and meta-analysis.Cochrane Database Syst Rev. 2020 Jul 31;7(7):CD012022. doi: 10.1002/14651858.CD012022.pub2. Cochrane Database Syst Rev. 2020. PMID: 32735048 Free PMC article.
-
Comparison of Severity of Illness Scores and Artificial Intelligence Models That Are Predictive of Intensive Care Unit Mortality: Meta-analysis and Review of the Literature.JMIR Med Inform. 2022 May 31;10(5):e35293. doi: 10.2196/35293. JMIR Med Inform. 2022. PMID: 35639445 Free PMC article. Review.
Cited by
-
Diagnostic accuracy of convolutional neural networks in classifying hepatic steatosis from B-mode ultrasound images: a systematic review with meta-analysis and novel validation in a community setting in Telangana, India.Lancet Reg Health Southeast Asia. 2025 Jul 31;40:100644. doi: 10.1016/j.lansea.2025.100644. eCollection 2025 Sep. Lancet Reg Health Southeast Asia. 2025. PMID: 40791854 Free PMC article.
-
A common longitudinal intensive care unit data format (CLIF) for critical illness research.Intensive Care Med. 2025 Mar;51(3):556-569. doi: 10.1007/s00134-025-07848-7. Epub 2025 Mar 13. Intensive Care Med. 2025. PMID: 40080116 Free PMC article.
-
Current and Emerging Applications of Artificial Intelligence in Medical Imaging for Paediatric Hip Disorders-A Scoping Review.Children (Basel). 2025 May 16;12(5):645. doi: 10.3390/children12050645. Children (Basel). 2025. PMID: 40426824 Free PMC article. Review.
-
Machine Learning and Artificial Intelligence for Infectious Disease Surveillance, Diagnosis, and Prognosis.Viruses. 2025 Jun 23;17(7):882. doi: 10.3390/v17070882. Viruses. 2025. PMID: 40733500 Free PMC article. Review.
References
-
- Ferreira FL, Bota DP, Bross A, Mélot C, Vincent JL. Serial evaluation of the SOFA score to predict outcome in critically ill patients. JAMA. 2001;286:1754–8. - PubMed
-
- Vincent JL, Moreno R, Takala J, Willatts S, De Mendonça A, Bruining H, et al. The SOFA (Sepsis-related Organ failure Assessment) score to describe organ dysfunction/failure. On behalf of the Working Group on Sepsis-related problems of the European Society of Intensive Care Medicine. Intensive Care Med. 1996;22:707–10. - PubMed
-
- Vincent J-L, de Mendonca A, Cantraine F, Moreno R, Takala J, Suter PM, et al. Use of the SOFA score to assess the incidence of organ dysfunction/failure in intensive care units: results of a multicenter, prospective study. Crit Care Med. 1998;26:1793. - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources