Evaluation of crowdsourced mortality prediction models as a framework for assessing artificial intelligence in medicine

Timothy Bergquist^{1

2}, Thomas Schaffter¹, Yao Yan^{1

3}, Thomas Yu¹, Justin Prosser⁴, Jifan Gao⁵, Guanhua Chen⁵, Łukasz Charzewski^{6

7}, Zofia Nawalany⁶, Ivan Brugere⁸, Renata Retkute⁹, Alidivinas Prusokas¹⁰, Augustinas Prusokas¹¹, Yonghwa Choi¹², Sanghoon Lee¹², Junseok Choe¹², Inggeol Lee¹³, Sunkyu Kim¹², Jaewoo Kang^{12

13}, Sean D Mooney², Justin Guinney^{1

2}; Patient Mortality Prediction DREAM Challenge Consortium

Collaborators, Affiliations

PMID: 37604111
PMCID: PMC10746301
DOI: 10.1093/jamia/ocad159

Evaluation of crowdsourced mortality prediction models as a framework for assessing artificial intelligence in medicine

Timothy Bergquist et al. J Am Med Inform Assoc. 2023.

. 2023 Dec 22;31(1):35-44.

doi: 10.1093/jamia/ocad159.

PMID: 37604111
PMCID: PMC10746301
DOI: 10.1093/jamia/ocad159

Erratum in

Correction to: Evaluation of crowdsourced mortality prediction models as a framework for assessing artificial intelligence in medicine.
[No authors listed] [No authors listed] J Am Med Inform Assoc. 2024 Nov 1;31(11):2772. doi: 10.1093/jamia/ocae219. J Am Med Inform Assoc. 2024. PMID: 39150868 Free PMC article. No abstract available.

Abstract

Objective: Applications of machine learning in healthcare are of high interest and have the potential to improve patient care. Yet, the real-world accuracy of these models in clinical practice and on different patient subpopulations remains unclear. To address these important questions, we hosted a community challenge to evaluate methods that predict healthcare outcomes. We focused on the prediction of all-cause mortality as the community challenge question.

Materials and methods: Using a Model-to-Data framework, 345 registered participants, coalescing into 25 independent teams, spread over 3 continents and 10 countries, generated 25 accurate models all trained on a dataset of over 1.1 million patients and evaluated on patients prospectively collected over a 1-year observation of a large health system.

Results: The top performing team achieved a final area under the receiver operator curve of 0.947 (95% CI, 0.942-0.951) and an area under the precision-recall curve of 0.487 (95% CI, 0.458-0.499) on a prospectively collected patient cohort.

Discussion: Post hoc analysis after the challenge revealed that models differ in accuracy on subpopulations, delineated by race or gender, even when they are trained on the same data.

Conclusion: This is the largest community challenge focused on the evaluation of state-of-the-art machine learning methods in a healthcare system performed to date, revealing both opportunities and pitfalls of clinical AI.

Keywords: evaluation; health informatics; machine learning.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
Model-to-data architecture to evaluate the performance of EHR prediction models in the Patient Mortality DREAM Challenge. Models were developed on local environments using synthetic data that resembled the real private EHR data. Docker images were submitted through the Synapse collaboration platform to a submission queue. Images were pulled into the National Center for Advancing Translational Sciences (NCATS) provided AWS cloud environment and run against a synthetic dataset for technical validation (Stage 1). Once validated, images were pulled into the UW Medicine secure infrastructure and run against the private EHR data. Model predictions were evaluated using area under the receiver operator curve (AUROC) and area under the precision recall curve (AUPRC) which were returned to participants through Synapse.

**Figure 2.**
Comparison of model performance between the leaderboard phase and the validation phase (Data in Table 2). All models decreased in AUROC and AUPRC. The top 5 teams’ AUROCs decreased the least between the 2 phases. Only the top 5 team’s performances are colored. The error bars for the AUROCs represent the 95% confidence interval.

**Figure 3.**
Bootstrapped distributions (n = 10 000) of the top 10 model AUROCs broken down by race. Model predictions were randomly sampled with replacement and scored against the benchmark gold standard. Box-plot center lines represent the median AUROC, box limits represent the upper and lower quartiles, whiskers represent the 1.5× interquartile ranges, and the points represent the outliers. Comparisons were made between each category of race and Bayes values calculated to assess the level of evidence for the model having a higher accuracy on racial category compared to another category. The heat maps represent the log of the calculated Bayes factors when comparing racial groups within each model. The darker the red, the stronger the evidence for the racial category being higher than the comparison category. Bayes factor values range from 10 000 to 0.0001. The darker the blue, the stronger the evidence for the racial category being lower than the comparison category. The color scale is normalized across all comparisons. Raw Bayes factor values can be found in Table S2.

See this image and copyright information in PMC

References

1. Goldstein BA, Navar AM, Pencina MJ, Ioannidis JPA. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J Am Med Inform Assoc. 2017;24(1):198-208. - PMC - PubMed
1. Jauk S, Kramer D, Großauer B, et al. Risk prediction of delirium in hospitalized patients using machine learning: an implementation and prospective evaluation study. J Am Med Inform Assoc. 2020;27(9):1383-1392. - PMC - PubMed
1. Norel R, Rice JJ, Stolovitzky G. The self-assessment trap: can we all be better than average? Mol Syst Biol. 2011;7(1):537. - PMC - PubMed
1. Chen JH, Alagappan M, Goldstein MK, Asch SM, Altman RB. Decaying relevance of clinical data towards future decisions in data-driven inpatient clinical order sets. Int J Med Inform. 2017;102:71-79. - PMC - PubMed
1. Hammarlund N. Racial treatment disparities after machine learning surgical risk-adjustment. Health Serv Outcomes Res Method. 2021;21(2):248-286.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluation of crowdsourced mortality prediction models as a framework for assessing artificial intelligence in medicine

Evaluation of crowdsourced mortality prediction models as a framework for assessing artificial intelligence in medicine

Erratum in

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources