Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec 22;31(1):35-44.
doi: 10.1093/jamia/ocad159.

Evaluation of crowdsourced mortality prediction models as a framework for assessing artificial intelligence in medicine

Collaborators, Affiliations

Evaluation of crowdsourced mortality prediction models as a framework for assessing artificial intelligence in medicine

Timothy Bergquist et al. J Am Med Inform Assoc. .

Erratum in

Abstract

Objective: Applications of machine learning in healthcare are of high interest and have the potential to improve patient care. Yet, the real-world accuracy of these models in clinical practice and on different patient subpopulations remains unclear. To address these important questions, we hosted a community challenge to evaluate methods that predict healthcare outcomes. We focused on the prediction of all-cause mortality as the community challenge question.

Materials and methods: Using a Model-to-Data framework, 345 registered participants, coalescing into 25 independent teams, spread over 3 continents and 10 countries, generated 25 accurate models all trained on a dataset of over 1.1 million patients and evaluated on patients prospectively collected over a 1-year observation of a large health system.

Results: The top performing team achieved a final area under the receiver operator curve of 0.947 (95% CI, 0.942-0.951) and an area under the precision-recall curve of 0.487 (95% CI, 0.458-0.499) on a prospectively collected patient cohort.

Discussion: Post hoc analysis after the challenge revealed that models differ in accuracy on subpopulations, delineated by race or gender, even when they are trained on the same data.

Conclusion: This is the largest community challenge focused on the evaluation of state-of-the-art machine learning methods in a healthcare system performed to date, revealing both opportunities and pitfalls of clinical AI.

Keywords: evaluation; health informatics; machine learning.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Model-to-data architecture to evaluate the performance of EHR prediction models in the Patient Mortality DREAM Challenge. Models were developed on local environments using synthetic data that resembled the real private EHR data. Docker images were submitted through the Synapse collaboration platform to a submission queue. Images were pulled into the National Center for Advancing Translational Sciences (NCATS) provided AWS cloud environment and run against a synthetic dataset for technical validation (Stage 1). Once validated, images were pulled into the UW Medicine secure infrastructure and run against the private EHR data. Model predictions were evaluated using area under the receiver operator curve (AUROC) and area under the precision recall curve (AUPRC) which were returned to participants through Synapse.
Figure 2.
Figure 2.
Comparison of model performance between the leaderboard phase and the validation phase (Data in Table 2). All models decreased in AUROC and AUPRC. The top 5 teams’ AUROCs decreased the least between the 2 phases. Only the top 5 team’s performances are colored. The error bars for the AUROCs represent the 95% confidence interval.
Figure 3.
Figure 3.
Bootstrapped distributions (n =10 000) of the top 10 model AUROCs broken down by race. Model predictions were randomly sampled with replacement and scored against the benchmark gold standard. Box-plot center lines represent the median AUROC, box limits represent the upper and lower quartiles, whiskers represent the 1.5× interquartile ranges, and the points represent the outliers. Comparisons were made between each category of race and Bayes values calculated to assess the level of evidence for the model having a higher accuracy on racial category compared to another category. The heat maps represent the log of the calculated Bayes factors when comparing racial groups within each model. The darker the red, the stronger the evidence for the racial category being higher than the comparison category. Bayes factor values range from 10 000 to 0.0001. The darker the blue, the stronger the evidence for the racial category being lower than the comparison category. The color scale is normalized across all comparisons. Raw Bayes factor values can be found in Table S2.

References

    1. Goldstein BA, Navar AM, Pencina MJ, Ioannidis JPA. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J Am Med Inform Assoc. 2017;24(1):198-208. - PMC - PubMed
    1. Jauk S, Kramer D, Großauer B, et al. Risk prediction of delirium in hospitalized patients using machine learning: an implementation and prospective evaluation study. J Am Med Inform Assoc. 2020;27(9):1383-1392. - PMC - PubMed
    1. Norel R, Rice JJ, Stolovitzky G. The self-assessment trap: can we all be better than average? Mol Syst Biol. 2011;7(1):537. - PMC - PubMed
    1. Chen JH, Alagappan M, Goldstein MK, Asch SM, Altman RB. Decaying relevance of clinical data towards future decisions in data-driven inpatient clinical order sets. Int J Med Inform. 2017;102:71-79. - PMC - PubMed
    1. Hammarlund N. Racial treatment disparities after machine learning surgical risk-adjustment. Health Serv Outcomes Res Method. 2021;21(2):248-286.

Publication types