. 2022 Apr 5;1(4):e0000023.

doi: 10.1371/journal.pdig.0000023. eCollection 2022 Apr.

Generalizability challenges of mortality risk prediction models: A retrospective analysis on a multi-center database

Harvineet Singh¹, Vishwali Mhasawade², Rumi Chunara^{2

3}

Affiliations

¹ New York University, Center for Data Science.
² New York University, Tandon School of Engineering.
³ New York University, School of Global Public Health.

PMID: 36812510
PMCID: PMC9931319
DOI: 10.1371/journal.pdig.0000023

Generalizability challenges of mortality risk prediction models: A retrospective analysis on a multi-center database

Harvineet Singh et al. PLOS Digit Health. 2022.

. 2022 Apr 5;1(4):e0000023.

doi: 10.1371/journal.pdig.0000023. eCollection 2022 Apr.

Authors

Harvineet Singh¹, Vishwali Mhasawade², Rumi Chunara^{2

3}

Affiliations

¹ New York University, Center for Data Science.
² New York University, Tandon School of Engineering.
³ New York University, School of Global Public Health.

PMID: 36812510
PMCID: PMC9931319
DOI: 10.1371/journal.pdig.0000023

Abstract

Modern predictive models require large amounts of data for training and evaluation, absence of which may result in models that are specific to certain locations, populations in them and clinical practices. Yet, best practices for clinical risk prediction models have not yet considered such challenges to generalizability. Here we ask whether population- and group-level performance of mortality prediction models vary significantly when applied to hospitals or geographies different from the ones in which they are developed. Further, what characteristics of the datasets explain the performance variation? In this multi-center cross-sectional study, we analyzed electronic health records from 179 hospitals across the US with 70,126 hospitalizations from 2014 to 2015. Generalization gap, defined as difference between model performance metrics across hospitals, is computed for area under the receiver operating characteristic curve (AUC) and calibration slope. To assess model performance by the race variable, we report differences in false negative rates across groups. Data were also analyzed using a causal discovery algorithm "Fast Causal Inference" that infers paths of causal influence while identifying potential influences associated with unmeasured variables. When transferring models across hospitals, AUC at the test hospital ranged from 0.777 to 0.832 (1st-3rd quartile or IQR; median 0.801); calibration slope from 0.725 to 0.983 (IQR; median 0.853); and disparity in false negative rates from 0.046 to 0.168 (IQR; median 0.092). Distribution of all variable types (demography, vitals, and labs) differed significantly across hospitals and regions. The race variable also mediated differences in the relationship between clinical variables and mortality, by hospital/region. In conclusion, group-level performance should be assessed during generalizability checks to identify potential harms to the groups. Moreover, for developing methods to improve model performance in new environments, a better understanding and documentation of provenance of data and health processes are needed to identify and mitigate sources of variation.

Copyright: © 2022 Singh et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Generalization of performance metrics across individual hospitals.**
Results of transferring models across top 10 hospitals by number of stays. Models are trained and tested on a fixed number of samples (1631, the least in any of the 10 hospitals) from each hospital. Results are averaged over 100 random subsamples for each of the 10×10 train-test hospital pairs. All 6 metrics show large variability when transferring models across hospitals. Abbreviations: AUC, area under ROC curve; CS, calibration slope; FNR, false negative rate.

**Fig 2. Generalization of performance metrics across US geographic regions.**
Results of transferring models after pooling hospitals into 4 regions (northeast, south, midwest, west). Models are trained and tested on 5000 samples from each region. Results are averaged over 100 random subsamples for each of the 4×4 train-test hospital pairs. DisparityFNR and DisparityCS show large variability when transferring models across regions. Abbreviations: AUC, area under ROC curve; CS, calibration slope; FNR, false negative rate.

**Fig 3. Statistical tests for dataset shifts.**
Results for two-sample tests with and without pooling of hospitals by region. Test results are plotted in (a,c) and test statistics are plotted in (b,d) to examine the test results in more detail. Since the order of hospitals considered in the two-sample test does not change the test statistic, we plot only the lower halves of the matrices. Results are averaged over 100 random subsamples. Feature distribution changes across all hospital and region pairs.

**Fig 4. Shifts in variable distributions due to hospital, region, and other factors based on mortality causal graph.**
Each row represents (in red) the features which explain the shifts across each of the indicators labeling the row, i.e. the features with an edge from the indicator in the causal graph. For instance, shift across the hospital ID indicator (first row) is explained by shifts in the distributions of age, race, temp (or temperature), urine output, and so on. We observe that shifts are explained by changes in a few variables which are common across indicators. Full forms of the abbreviated feature names are added in Table C in S1 Text.

See this image and copyright information in PMC

Cited by

External validation of AI models in health should be replaced with recurring local validation.
Youssef A, Pencina M, Thakur A, Zhu T, Clifton D, Shah NH. Youssef A, et al. Nat Med. 2023 Nov;29(11):2686-2687. doi: 10.1038/s41591-023-02540-z. Nat Med. 2023. PMID: 37853136 No abstract available.
Predicting Mortality in Trauma Research: Evaluating the Performance of Trauma Scoring Tools in a South African Population.
Collora CE, Xiao M, Fosdick B, Lategan HJ, Finn J, Schauer SG, Dixon J, Bhaumik S, Stassen W, de Vries S, Wylie C, Mould-Millman NK. Collora CE, et al. Cureus. 2024 Oct 10;16(10):e71225. doi: 10.7759/cureus.71225. eCollection 2024 Oct. Cureus. 2024. PMID: 39399278 Free PMC article.
Impact of localized fine tuning in the performance of segmentation and classification of lung nodules from computed tomography scans using deep learning.
Cai J, Guo L, Zhu L, Xia L, Qian L, Lure YF, Yin X. Cai J, et al. Front Oncol. 2023 Mar 28;13:1140635. doi: 10.3389/fonc.2023.1140635. eCollection 2023. Front Oncol. 2023. PMID: 37056345 Free PMC article.
Statistical Inference for Maximin Effects: Identifying Stable Associations across Multiple Studies.
Guo Z. Guo Z. J Am Stat Assoc. 2024;119(547):1968-1984. doi: 10.1080/01621459.2023.2233162. Epub 2023 Aug 4. J Am Stat Assoc. 2024. PMID: 39651449 Free PMC article.
Performance Drift in a Nationally Deployed Population Health Risk Algorithm in the US Veterans Health Administration.
Kolla L, Linn K, Navathe AS, Kreisler C, Roberts CB, Park SH, Singh H, Feng J, Chen J, Parikh RB. Kolla L, et al. JAMA Health Forum. 2025 Aug 1;6(8):e252717. doi: 10.1001/jamahealthforum.2025.2717. JAMA Health Forum. 2025. PMID: 40815520 Free PMC article.

See all "Cited by" articles

References

1. Van Calster B, Vickers AJ. Calibration of risk prediction models: impact on decision-analytic performance. Med Decis Mak. 2015;35(2):162–9. doi: 10.1177/0272989X14547233 - DOI - PubMed
1. Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of prognostic information. Ann Intern Med. 1999;130(6):515–24. doi: 10.7326/0003-4819-130-6-199903160-00016 - DOI - PubMed
1. Moons KGM, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg EW, et al.. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med. 2015;162(1):W1—W73. doi: 10.7326/M14-0698 - DOI - PubMed
1. Chen IY, Szolovits P, Ghassemi M. Can AI Help Reduce Disparities in General Medical and Mental Health Care? AMA J ethics. 2019;21(2):167–79. - PubMed
1. Pfohl SR, Foryciarz A, Shah NH. An empirical characterization of fair machine learning for clinical risk prediction. J Biomed Inform. 2021;113:103621. doi: 10.1016/j.jbi.2020.103621 - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Generalizability challenges of mortality risk prediction models: A retrospective analysis on a multi-center database

Affiliations

Generalizability challenges of mortality risk prediction models: A retrospective analysis on a multi-center database

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Miscellaneous