Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Observational Study
. 2024 Nov 1;52(11):1710-1721.
doi: 10.1097/CCM.0000000000006359. Epub 2024 Jul 3.

The Impact of Multi-Institution Datasets on the Generalizability of Machine Learning Prediction Models in the ICU

Affiliations
Observational Study

The Impact of Multi-Institution Datasets on the Generalizability of Machine Learning Prediction Models in the ICU

Patrick Rockenschaub et al. Crit Care Med. .

Abstract

Objectives: To evaluate the transferability of deep learning (DL) models for the early detection of adverse events to previously unseen hospitals.

Design: Retrospective observational cohort study utilizing harmonized intensive care data from four public datasets.

Setting: ICUs across Europe and the United States.

Patients: Adult patients admitted to the ICU for at least 6 hours who had good data quality.

Interventions: None.

Measurements and main results: Using carefully harmonized data from a total of 334,812 ICU stays, we systematically assessed the transferability of DL models for three common adverse events: death, acute kidney injury (AKI), and sepsis. We tested whether using more than one data source and/or algorithmically optimizing for generalizability during training improves model performance at new hospitals. We found that models achieved high area under the receiver operating characteristic (AUROC) for mortality (0.838-0.869), AKI (0.823-0.866), and sepsis (0.749-0.824) at the training hospital. As expected, AUROC dropped when models were applied at other hospitals, sometimes by as much as -0.200. Using more than one dataset for training mitigated the performance drop, with multicenter models performing roughly on par with the best single-center model. Dedicated methods promoting generalizability did not noticeably improve performance in our experiments.

Conclusions: Our results emphasize the importance of diverse training data for DL-based risk prediction. They suggest that as data from more hospitals become available for training, models may become increasingly generalizable. Even so, good performance at a new hospital still depended on the inclusion of compatible hospitals during training.

PubMed Disclaimer

Conflict of interest statement

The authors have disclosed that they do not have any potential conflicts of interest.

Figures

Figure 1.
Figure 1.
Schematic overview of the experimental setup. CORAL = correlation alignment.
Figure 2.
Figure 2.
Performance of the ICU mortality prediction model when trained on one dataset (rows) and evaluated on others (columns). A, Area under the receiver operating characteristic (AUROC). B, Area under the precision-recall curve (AUPRC). C, Sensitivity. D, Positive predictive value (PPV). The diagonal represents internal validation, that is, training and test samples were taken from the same dataset. In pooled (n–1), the model was trained on combined data from all except the test dataset. In all, training data from all datasets (including the test dataset) was used during model development. Sensitivity and PPV were evaluated on the stay level and thresholds were chosen based on the validation portion of the training dataset. Performance was averaged across five random resamples. All models used a gated recurrent unit featurizer. AUMCdb = Amsterdam University Medical Centers database, eICU = eICU Collaborative Research Database, HiRID = High Time Resolution ICU Dataset, MIMIC = Medical Information Mart for Intensive Care IV.
Figure 3.
Figure 3.
Performance of the acute kidney injury prediction model. A, Area under the receiver operating characteristic (AUROC). B, Area under the precision-recall curve (AUPRC). C, Sensitivity. D, Positive predictive value (PPV). All models used a gated recurrent unit featurizer. AUMCdb = Amsterdam University Medical Centers database, eICU = eICU Collaborative Research Database, HiRID = High Time Resolution ICU Dataset, MIMIC = Medical Information Mart for Intensive Care IV.
Figure 4.
Figure 4.
Performance of the sepsis prediction model. A, Area under the receiver operating characteristic (AUROC). B, Area under the precision-recall curve (AUPRC). C, Sensitivity. D, Positive predictive value (PPV). All models used a gated recurrent unit featurizer. AUMCdb = Amsterdam University Medical Centers database, eICU = eICU Collaborative Research Database, HiRID = High Time Resolution ICU Dataset, MIMIC = Medical Information Mart for Intensive Care IV.

References

    1. Kelly CJ, Karthikesalingam A, Suleyman M, et al. : Key challenges for delivering clinical impact with artificial intelligence. BMC Med 2019; 17:195. - PMC - PubMed
    1. Shillan D, Sterne JAC, Champneys A, et al. : Use of machine learning to analyse routinely collected intensive care unit data: A systematic review. Crit Care 2019; 23:284. - PMC - PubMed
    1. Silva I, Moody G, Scott DJ, et al. : Predicting in-hospital mortality of ICU patients: The PhysioNet/Computing in cardiology challenge 2012. Comput Cardiol (2010) 2012; 39:245–248 - PMC - PubMed
    1. Pirracchio R, Petersen ML, Carone M, et al. : Mortality prediction in intensive care units with the Super ICU Learner Algorithm (SICULA): A population-based study. Lancet Respir Med 2015; 3:42–52 - PMC - PubMed
    1. Meyer A, Zverinski D, Pfahringer B, et al. : Machine learning for real-time prediction of complications in critical care: A retrospective study. Lancet Respir Med 2018; 6:905–914 - PubMed

Publication types