The Development and Validation of Simplified Machine Learning Algorithms to Predict Prognosis of Hospitalized Patients With COVID-19: Multicenter, Retrospective Study
- PMID: 34951865
- PMCID: PMC8785956
- DOI: 10.2196/31549
The Development and Validation of Simplified Machine Learning Algorithms to Predict Prognosis of Hospitalized Patients With COVID-19: Multicenter, Retrospective Study
Abstract
Background: The current COVID-19 pandemic is unprecedented; under resource-constrained settings, predictive algorithms can help to stratify disease severity, alerting physicians of high-risk patients; however, there are only few risk scores derived from a substantially large electronic health record (EHR) data set, using simplified predictors as input.
Objective: The objectives of this study were to develop and validate simplified machine learning algorithms that predict COVID-19 adverse outcomes; to evaluate the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, and calibration of the algorithms; and to derive clinically meaningful thresholds.
Methods: We performed machine learning model development and validation via a cohort study using multicenter, patient-level, longitudinal EHRs from the Optum COVID-19 database that provides anonymized, longitudinal EHR from across the United States. The models were developed based on clinical characteristics to predict 28-day in-hospital mortality, intensive care unit (ICU) admission, respiratory failure, and mechanical ventilator usages at inpatient setting. Data from patients who were admitted from February 1, 2020, to September 7, 2020, were randomly sampled into development, validation, and test data sets; data collected from September 7, 2020, to November 15, 2020, were reserved as the postdevelopment prospective test data set.
Results: Of the 3.7 million patients in the analysis, 585,867 patients were diagnosed or tested positive for SARS-CoV-2, and 50,703 adult patients were hospitalized with COVID-19 between February 1 and November 15, 2020. Among the study cohort (n=50,703), there were 6204 deaths, 9564 ICU admissions, 6478 mechanically ventilated or EMCO patients, and 25,169 patients developed acute respiratory distress syndrome or respiratory failure within 28 days since hospital admission. The algorithms demonstrated high accuracy (AUC 0.89, 95% CI 0.89-0.89 on the test data set [n=10,752]), consistent prediction through the second wave of the pandemic from September to November (AUC 0.85, 95% CI 0.85-0.86) on the postdevelopment prospective test data set [n=14,863], great clinical relevance, and utility. Besides, a comprehensive set of 386 input covariates from baseline or at admission were included in the analysis; the end-to-end pipeline automates feature selection and model development. The parsimonious model with only 10 input predictors produced comparably accurate predictions; these 10 predictors (age, blood urea nitrogen, SpO2, systolic and diastolic blood pressures, respiration rate, pulse, temperature, albumin, and major cognitive disorder excluding stroke) are commonly measured and concordant with recognized risk factors for COVID-19.
Conclusions: The systematic approach and rigorous validation demonstrate consistent model performance to predict even beyond the period of data collection, with satisfactory discriminatory power and great clinical utility. Overall, the study offers an accurate, validated, and reliable prediction model based on only 10 clinical features as a prognostic tool to stratifying patients with COVID-19 into intermediate-, high-, and very high-risk groups. This simple predictive tool is shared with a wider health care community, to enable service as an early warning system to alert physicians of possible high-risk patients, or as a resource triaging tool to optimize health care resources.
Keywords: COVID-19; machine learning; predictive algorithm; prognostic model.
©Fang He, John H Page, Kerry R Weinberg, Anirban Mishra. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 21.01.2022.
Conflict of interest statement
Conflicts of Interest: FH, JHP, and AM are employees and stockholders of Amgen, Inc. KRW, an employee of League Inc, was formerly an employee of Amgen, Inc and owns stock in Amgen, Inc.
Figures





Similar articles
-
Development and Validation of a Robust and Interpretable Early Triaging Support System for Patients Hospitalized With COVID-19: Predictive Algorithm Modeling and Interpretation Study.J Med Internet Res. 2024 Jan 11;26:e52134. doi: 10.2196/52134. J Med Internet Res. 2024. PMID: 38206673 Free PMC article.
-
Machine Learning to Predict Mortality and Critical Events in a Cohort of Patients With COVID-19 in New York City: Model Development and Validation.J Med Internet Res. 2020 Nov 6;22(11):e24018. doi: 10.2196/24018. J Med Internet Res. 2020. PMID: 33027032 Free PMC article.
-
Machine learning algorithms for predicting COVID-19 mortality in Ethiopia.BMC Public Health. 2024 Jun 28;24(1):1728. doi: 10.1186/s12889-024-19196-0. BMC Public Health. 2024. PMID: 38943093 Free PMC article.
-
Using Predictive Models to Improve Care for Patients Hospitalized with COVID-19 [Internet].Washington (DC): Patient-Centered Outcomes Research Institute (PCORI); 2023 Jan. Washington (DC): Patient-Centered Outcomes Research Institute (PCORI); 2023 Jan. PMID: 38976624 Free Books & Documents. Review.
-
Developing and Testing Models for COVID-19 Health Outcomes [Internet].Washington (DC): Patient-Centered Outcomes Research Institute (PCORI); 2023 Mar. Washington (DC): Patient-Centered Outcomes Research Institute (PCORI); 2023 Mar. PMID: 37440675 Free Books & Documents. Review.
Cited by
-
In-depth analysis of the risk factors for persistent severe acute respiratory syndrome coronavirus 2 infection and construction of predictive models: an exploratory research study.BMC Infect Dis. 2025 May 14;25(1):699. doi: 10.1186/s12879-025-11083-2. BMC Infect Dis. 2025. PMID: 40369416 Free PMC article.
-
Fib-4 score is able to predict intra-hospital mortality in 4 different SARS-COV2 waves.Intern Emerg Med. 2023 Aug;18(5):1415-1427. doi: 10.1007/s11739-023-03310-y. Epub 2023 Jul 25. Intern Emerg Med. 2023. PMID: 37491564 Free PMC article.
-
Analysis of Publication Activity and Research Trends in the Field of AI Medical Applications: Network Approach.Int J Environ Res Public Health. 2023 Mar 30;20(7):5335. doi: 10.3390/ijerph20075335. Int J Environ Res Public Health. 2023. PMID: 37047950 Free PMC article.
-
Unraveling complex relationships between COVID-19 risk factors using machine learning based models for predicting mortality of hospitalized patients and identification of high-risk group: a large retrospective study.Front Med (Lausanne). 2023 May 4;10:1170331. doi: 10.3389/fmed.2023.1170331. eCollection 2023. Front Med (Lausanne). 2023. PMID: 37215714 Free PMC article.
-
Assessing the impact of vaccines on COVID-19 efficacy in survival rates: a survival analysis approach for clinical decision support.Front Public Health. 2024 Nov 18;12:1437388. doi: 10.3389/fpubh.2024.1437388. eCollection 2024. Front Public Health. 2024. PMID: 39624415 Free PMC article.
References
-
- Knight SR, Ho A, Pius R, Buchan I, Carson G, Drake TM, Dunning J, Fairfield CJ, Gamble C, Green CA, Gupta R, Halpin S, Hardwick HE, Holden KA, Horby PW, Jackson C, Mclean KA, Merson L, Nguyen-Van-Tam JS, Norman L, Noursadeghi M, Olliaro PL, Pritchard MG, Russell CD, Shaw CA, Sheikh A, Solomon T, Sudlow C, Swann OV, Turtle LC, Openshaw PJ, Baillie JK, Semple MG, Docherty AB, Harrison EM, ISARIC4C investigators Risk stratification of patients admitted to hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: development and validation of the 4C Mortality Score. BMJ. 2020 Sep 09;370:m3339. doi: 10.1136/bmj.m3339. http://www.bmj.com/lookup/pmidlookup?view=long&pmid=32907855 - DOI - PMC - PubMed
-
- Liang W, Liang H, Ou L, Chen B, Chen A, Li C, Li Y, Guan W, Sang L, Lu J, Xu Y, Chen G, Guo H, Guo J, Chen Z, Zhao Y, Li S, Zhang N, Zhong N, He J, China Medical Treatment Expert Group for COVID-19 Development and Validation of a Clinical Risk Score to Predict the Occurrence of Critical Illness in Hospitalized Patients With COVID-19. JAMA Intern Med. 2020 Aug 01;180(8):1081–1089. doi: 10.1001/jamainternmed.2020.2033. http://europepmc.org/abstract/MED/32396163 2766086 - DOI - PMC - PubMed
-
- Knaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: a severity of disease classification system. Crit Care Med. 1985 Oct;13(10):818–29. - PubMed
-
- Vincent JL, Moreno R, Takala J, Willatts S, De Mendonça A, Bruining H, Reinhart CK, Suter PM, Thijs LG. The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. On behalf of the Working Group on Sepsis-Related Problems of the European Society of Intensive Care Medicine. Intensive Care Med. 1996 Jul;22(7):707–10. doi: 10.1007/BF01709751. - DOI - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Medical
Miscellaneous