Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Meta-Analysis
. 2025 Jan 6;25(1):5.
doi: 10.1186/s12911-024-02830-7.

External validation of AI-based scoring systems in the ICU: a systematic review and meta-analysis

Affiliations
Meta-Analysis

External validation of AI-based scoring systems in the ICU: a systematic review and meta-analysis

Patrick Rockenschaub et al. BMC Med Inform Decis Mak. .

Abstract

Background: Machine learning (ML) is increasingly used to predict clinical deterioration in intensive care unit (ICU) patients through scoring systems. Although promising, such algorithms often overfit their training cohort and perform worse at new hospitals. Thus, external validation is a critical - but frequently overlooked - step to establish the reliability of predicted risk scores to translate them into clinical practice. We systematically reviewed how regularly external validation of ML-based risk scores is performed and how their performance changed in external data.

Methods: We searched MEDLINE, Web of Science, and arXiv for studies using ML to predict deterioration of ICU patients from routine data. We included primary research published in English before December 2023. We summarised how many studies were externally validated, assessing differences over time, by outcome, and by data source. For validated studies, we evaluated the change in area under the receiver operating characteristic (AUROC) attributable to external validation using linear mixed-effects models.

Results: We included 572 studies, of which 84 (14.7%) were externally validated, increasing to 23.9% by 2023. Validated studies made disproportionate use of open-source data, with two well-known US datasets (MIMIC and eICU) accounting for 83.3% of studies. On average, AUROC was reduced by -0.037 (95% CI -0.052 to -0.027) in external data, with more than 0.05 reduction in 49.5% of studies.

Discussion: External validation, although increasing, remains uncommon. Performance was generally lower in external data, questioning the reliability of some recently proposed ML-based scores. Interpretation of the results was challenged by an overreliance on the same few datasets, implicit differences in case mix, and exclusive use of AUROC.

Keywords: Acute deterioration; Electronic health records; External validation; Intensive care unit; Machine learning.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
PRISMA flow diagram
Fig. 2
Fig. 2
Number of eligible (black) and included externally validated studies (orange) by year of publication
Fig. 3
Fig. 3
A Number of studies that used MIMIC, eICU, and/or one or more other datasets; B Number of studies in which MIMIC, eICU, and other datasets were used for model development, external validation, or both
Fig. 4
Fig. 4
Reported AUROCs for internal and external validation among (N = 84 − 2) included studies. Two studies were omitted because they did not report AUROC

Similar articles

Cited by

References

    1. Ferreira FL, Bota DP, Bross A, Mélot C, Vincent JL. Serial evaluation of the SOFA score to predict outcome in critically ill patients. JAMA. 2001;286:1754–8. - PubMed
    1. Vincent J-L, Moreno R. Clinical review: scoring systems in the critically ill. Crit Care. 2010;14:207. - PMC - PubMed
    1. Gerry S, Bonnici T, Birks J, Kirtley S, Virdee PS, Watkinson PJ, et al. Early warning scores for detecting deterioration in adult hospital patients: systematic review and critical appraisal of methodology. BMJ. 2020;369:m1501. - PMC - PubMed
    1. Vincent JL, Moreno R, Takala J, Willatts S, De Mendonça A, Bruining H, et al. The SOFA (Sepsis-related Organ failure Assessment) score to describe organ dysfunction/failure. On behalf of the Working Group on Sepsis-related problems of the European Society of Intensive Care Medicine. Intensive Care Med. 1996;22:707–10. - PubMed
    1. Vincent J-L, de Mendonca A, Cantraine F, Moreno R, Takala J, Suter PM, et al. Use of the SOFA score to assess the incidence of organ dysfunction/failure in intensive care units: results of a multicenter, prospective study. Crit Care Med. 1998;26:1793. - PubMed

LinkOut - more resources