. 2024 Jun 12:5:e45973.

doi: 10.2196/45973.

Performance Drift in Machine Learning Models for Cardiac Surgery Risk Prediction: Retrospective Analysis

Affiliations

¹ Bristol Heart Institute, Translational Health Sciences, University of Bristol, Bristol, United Kingdom.
² School of Computing Science, Northumbria University, Newcastle upon Tyne, United Kingdom.
³ Department of Cardiac Surgery, Rabindranath Tagore International Institute of Cardiac Sciences, West Bengal, India.

PMID: 38889069
PMCID: PMC11217160
DOI: 10.2196/45973

Performance Drift in Machine Learning Models for Cardiac Surgery Risk Prediction: Retrospective Analysis

Tim Dong et al. JMIRx Med. 2024.

. 2024 Jun 12:5:e45973.

doi: 10.2196/45973.

Authors

Affiliations

¹ Bristol Heart Institute, Translational Health Sciences, University of Bristol, Bristol, United Kingdom.
² School of Computing Science, Northumbria University, Newcastle upon Tyne, United Kingdom.
³ Department of Cardiac Surgery, Rabindranath Tagore International Institute of Cardiac Sciences, West Bengal, India.

PMID: 38889069
PMCID: PMC11217160
DOI: 10.2196/45973

Abstract

Background: The Society of Thoracic Surgeons and European System for Cardiac Operative Risk Evaluation (EuroSCORE) II risk scores are the most commonly used risk prediction models for in-hospital mortality after adult cardiac surgery. However, they are prone to miscalibration over time and poor generalization across data sets; thus, their use remains controversial. Despite increased interest, a gap in understanding the effect of data set drift on the performance of machine learning (ML) over time remains a barrier to its wider use in clinical practice. Data set drift occurs when an ML system underperforms because of a mismatch between the data it was developed from and the data on which it is deployed.

Objective: In this study, we analyzed the extent of performance drift using models built on a large UK cardiac surgery database. The objectives were to (1) rank and assess the extent of performance drift in cardiac surgery risk ML models over time and (2) investigate any potential influence of data set drift and variable importance drift on performance drift.

Methods: We conducted a retrospective analysis of prospectively, routinely gathered data on adult patients undergoing cardiac surgery in the United Kingdom between 2012 and 2019. We temporally split the data 70:30 into a training and validation set and a holdout set. Five novel ML mortality prediction models were developed and assessed, along with EuroSCORE II, for relationships between and within variable importance drift, performance drift, and actual data set drift. Performance was assessed using a consensus metric.

Results: A total of 227,087 adults underwent cardiac surgery during the study period, with a mortality rate of 2.76% (n=6258). There was strong evidence of a decrease in overall performance across all models (P<.0001). Extreme gradient boosting (clinical effectiveness metric [CEM] 0.728, 95% CI 0.728-0.729) and random forest (CEM 0.727, 95% CI 0.727-0.728) were the overall best-performing models, both temporally and nontemporally. EuroSCORE II performed the worst across all comparisons. Sharp changes in variable importance and data set drift from October to December 2017, from June to July 2018, and from December 2018 to February 2019 mirrored the effects of performance decrease across models.

Conclusions: All models show a decrease in at least 3 of the 5 individual metrics. CEM and variable importance drift detection demonstrate the limitation of logistic regression methods used for cardiac surgery risk prediction and the effects of data set drift. Future work will be required to determine the interplay between ML models and whether ensemble models could improve on their respective performance advantages.

Keywords: United Kingdom; adult; artificial intelligence; cardiac; cardiac surgery; cardiology; data; data set drift; heart; machine learning; model; mortality; national data set; operative mortality; performance; performance drift; prediction; risk; risk prediction; surgery.

© Tim Dong, Shubhra Sinha, Ben Zhai, Daniel Fudulu, Jeremy Chan, Pradeep Narayan, Andy Judge, Massimo Caputo, Arnaldo Dimagli, Umberto Benedetto, Gianni D Angelini. Originally published in JMIRx Med (https://med.jmirx.org).

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

Figure 1.. Design overview of the study. Nontemporal performance and drift (temporal) analyses were performed. Drifts in discrimination, calibration, clinical utility, data set, and variable importance were assessed. Time point assessments were performed for the clinical effectiveness metric (CEM). Drifts in component metrics of CEM were evaluated. AUC: area under the curve; ECE: expected calibration error; EuroSCORE: European System for Cardiac Operative Risk Evaluation; F1: F₁-score; neuronetwork: neural network; SVM: support vector machine; Xgboost: extreme gradient boosting.

Figure 2.. (A) Plot of CEM values by model and time. Geometric mean (95% CI) of 1000 bootstraps at each time point is shown. The horizontal line represents the CEM geometric mean of all models. (B) Box plot of difference in models’ CEM values across the first 3 months of 2017 and 2019. Kruskal-Wallis results for CEM across the time points are shown. (C) Paired-samples Wilcoxon test (Wilcoxon signed rank test) for the first 3 months of 2019 bootstrap CEM values. P values are adjusted using the Bonferroni method. ****P<.0001. CEM: clinical effectiveness metric; EuroSCORE: European System for Cardiac Operative Risk Evaluation; ns: not significant; neuronetwork: neural network; SVM: support vector machine; Xgboost: extreme gradient boosting.

Figure 3.. Plots of CEM values by model and time: (A) XGBoost, (B) random forest, (C) logistic regression, and (D) EuroSCORE II. The geometric mean of 1000 bootstraps at each time point is shown. The red dotted line shows linear regression, and the blue line shows generalized additive model fit. Parameters and P values for the linear regressions are shown. (E) Discrimination (AUC) performance drift by time. Linear regression lines are plotted for each model, with slope, intercept, and P values displayed in the legend. (F) Calibration (adjusted ECE) performance drift by time. Linear regression lines are plotted for each model, with slope, intercept and P values displayed in the legend. SVM and EuroSCORE II are removed to enable a clearer separation of models with similar performance. AUC: area under the curve; CEM: clinical effectiveness metric; ECE: expected calibration error; EuroSCORE: European System for Cardiac Operative Risk Evaluation; neuronetwork: neural network; SVM: support vector machine; Xgboost: extreme gradient boosting.

Figure 4.. (A) Clinical effectiveness (net benefit) performance drift by time. Linear regression lines are plotted for each model, with slope, intercept, and P values displayed in the legend. SVM and EuroSCORE II are removed to enable a clearer separation of models with similar performance. (B) SHAP variable importance drift for the holdout set over 27 months (EuroSCORE II and XGBoost). Solid dots show geometric mean values of 5-fold cross-validation. Smoothed locally estimated scatterplot lines are plotted, with green bands showing 95% CIs. (C) SHAP variable importance drift for the holdout set over 27 months for the top 6 most important variables (EuroSCORE II and XGBoost). The trends are unsmoothed. (D) Operative urgency data set drift across time for the holdout set. The percentages of each category are shown for each time point. CCS: Canadian Cardiovascular Society; CPS: critical preoperative state; EuroSCORE: European System for Cardiac Operative Risk Evaluation; ES: EuroSCORE; LV: left ventricle; MI: myocardial infarction; neuronetwork: neural network; NYHA: New York Heart Association; PA: pulmonary artery; PVD: peripheral vascular disease; SHAP: Shapley additive explanations; SVM: support vector machine; Xgboost: extreme gradient boosting.

**Figure 5.. The actual and projected net benefit drift for the NN and Xgboost models over time. NN: neural network; XGBoost: extreme gradient boosting.**

See this image and copyright information in PMC

Update of

doi: 10.2196/preprints.45973
doi: 10.1101/2023.01.21.23284795

Cited by

Artificial Intelligence in Surgery: A Systematic Review of Use and Validation.
Kenig N, Monton Echeverria J, Muntaner Vives A. Kenig N, et al. J Clin Med. 2024 Nov 24;13(23):7108. doi: 10.3390/jcm13237108. J Clin Med. 2024. PMID: 39685566 Free PMC article. Review.
A machine learning algorithm-based risk prediction score for in-hospital/30-day mortality after adult cardiac surgery.
Sinha S, Dong T, Dimagli A, Judge A, Angelini GD. Sinha S, et al. Eur J Cardiothorac Surg. 2024 Oct 1;66(4):ezae368. doi: 10.1093/ejcts/ezae368. Eur J Cardiothorac Surg. 2024. PMID: 39374541 Free PMC article.
Use of pulse pressure index for cardiovascular outcomes assessment and development of a coronary heart disease model for the elderly.
Fang LX, Wu YH, Yao T, Wang ZN, Qian S, Jiang T, Xu J, Lin YN, Li YC. Fang LX, et al. BMC Cardiovasc Disord. 2025 Apr 18;25(1):297. doi: 10.1186/s12872-025-04641-8. BMC Cardiovasc Disord. 2025. PMID: 40251528 Free PMC article.
Machine learning-based hybrid risk estimation system (ERES) in cardiac surgery: Supplementary insights from the ASA score analysis.
Birlik AB, Tozan H, Köse KB. Birlik AB, et al. PLOS Digit Health. 2025 Jun 23;4(6):e0000889. doi: 10.1371/journal.pdig.0000889. eCollection 2025 Jun. PLOS Digit Health. 2025. PMID: 40549810 Free PMC article.
Enhancing Cardiovascular Risk Prediction: Development of an Advanced Xgboost Model with Hospital-Level Random Effects.
Dong T, Oronti IB, Sinha S, Freitas A, Zhai B, Chan J, Fudulu DP, Caputo M, Angelini GD. Dong T, et al. Bioengineering (Basel). 2024 Oct 18;11(10):1039. doi: 10.3390/bioengineering11101039. Bioengineering (Basel). 2024. PMID: 39451414 Free PMC article.

References

1. Ong CS, Reinertsen E, Sun H, et al. Prediction of operative mortality for patients undergoing cardiac surgical procedures without established risk scores. J Thorac Cardiovasc Surg. 2023 Apr;165(4):1449–1459. doi: 10.1016/j.jtcvs.2021.09.010. doi. Medline. - DOI - PMC - PubMed
1. Benedetto U, Dimagli A, Sinha S, et al. Machine learning improves mortality risk prediction after cardiac surgery: systematic review and meta-analysis. J Thorac Cardiovasc Surg. 2022 Jun;163(6):2075–2087. doi: 10.1016/j.jtcvs.2020.07.105. doi. Medline. - DOI - PubMed
1. Kieser TM, Rose MS, Head SJ. Comparison of logistic EuroSCORE and EuroSCORE II in predicting operative mortality of 1125 total arterial operations. Eur J Cardiothorac Surg. 2016 Sep;50(3):509–518. doi: 10.1093/ejcts/ezw072. doi. Medline. - DOI - PubMed
1. Poullis M, Pullan M, Chalmers J, Mediratta N. The validity of the original EuroSCORE and EuroSCORE II in patients over the age of seventy. Interact Cardiovasc Thorac Surg. 2015 Feb;20(2):172–177. doi: 10.1093/icvts/ivu345. doi. Medline. - DOI - PubMed
1. Zhang GX, Wang C, Wang L, et al. Validation of EuroSCORE II in Chinese patients undergoing heart valve surgery. Heart Lung Circ. 2013 Aug;22(8):606–611. doi: 10.1016/j.hlc.2012.12.012. doi. Medline. - DOI - PubMed

Grants and funding

CH/17/1/32804/BHF_/British Heart Foundation/United Kingdom

LinkOut - more resources

Full Text Sources
- JMIR Publications
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Performance Drift in Machine Learning Models for Cardiac Surgery Risk Prediction: Retrospective Analysis

Affiliations

Performance Drift in Machine Learning Models for Cardiac Surgery Risk Prediction: Retrospective Analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Update of

Similar articles

Cited by

References

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous