Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct 6:7:e46807.
doi: 10.2196/46807.

Identification and Prediction of Clinical Phenotypes in Hospitalized Patients With COVID-19: Machine Learning From Medical Records

Affiliations

Identification and Prediction of Clinical Phenotypes in Hospitalized Patients With COVID-19: Machine Learning From Medical Records

Tom Velez et al. JMIR Form Res. .

Abstract

Background: There is significant heterogeneity in disease progression among hospitalized patients with COVID-19. The pathogenesis of SARS-CoV-2 infection is attributed to a complex interplay between virus and host immune response that in some patients unpredictably and rapidly leads to "hyperinflammation" associated with increased risk of mortality. The early identification of patients at risk of progression to hyperinflammation may help inform timely therapeutic decisions and lead to improved outcomes.

Objective: The primary objective of this study was to use machine learning to reproducibly identify specific risk-stratifying clinical phenotypes across hospitalized patients with COVID-19 and compare treatment response characteristics and outcomes. A secondary objective was to derive a predictive phenotype classification model using routinely available early encounter data that may be useful in informing optimal COVID-19 bedside clinical management.

Methods: This was a retrospective analysis of electronic health record data of adult patients (N=4379) who were admitted to a Johns Hopkins Health System hospital for COVID-19 treatment from 2020 to 2021. Phenotypes were identified by clustering 38 routine clinical observations recorded during inpatient care. To examine the reproducibility and validity of the derived phenotypes, patient data were randomly divided into 2 cohorts, and clustering analysis was performed independently for each cohort. A predictive phenotype classifier using the gradient-boosting machine method was derived using routine clinical observations recorded during the first 6 hours following admission.

Results: A total of 2 phenotypes (designated as phenotype 1 and phenotype 2) were identified in patients admitted for COVID-19 in both the training and validation cohorts with similar distributions of features, correlations with biomarkers, treatments, comorbidities, and outcomes. In both the training and validation cohorts, phenotype-2 patients were older; had elevated markers of inflammation; and were at an increased risk of requiring intensive care unit-level care, developing sepsis, and mortality compared with phenotype-1 patients. The gradient-boosting machine phenotype prediction model yielded an area under the curve of 0.89 and a positive predictive value of 0.83.

Conclusions: Using machine learning clustering, we identified and internally validated 2 clinical COVID-19 phenotypes with distinct treatment or response characteristics consistent with similar 2-phenotype models derived from other hospitalized populations with COVID-19, supporting the reliability and generalizability of these findings. COVID-19 phenotypes can be accurately identified using machine learning models based on readily available early encounter clinical data. A phenotype prediction model based on early encounter data may be clinically useful for timely bedside risk stratification and treatment personalization.

Keywords: COVID; big data; biomarkers; clinical phenotypes; critical care; early warning; electronic medical record; immune response; infection; machine learning; mortality; pathogenesis; phenotype; respiratory distress; sepsis; support tool; training; treatment; utility.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: TV is the chief executive officer of Computer Technology Associates, Inc, a small business engaged in the commercialization of an artificial intelligence platform called “VFusion” directed at the clinical decision support market. Computer Technology Associates self-funded their participation in this study. BG is a member of the Food and Drug Administration Pulmonary-Allergy Drugs Advisory Committee and a board member of the Society of Bedside Medicine and has received consulting fees from Janssen Research and Development, LLC (related to vaccine trial case adjudication); Gilead Sciences, Inc (related to COVID-19 therapeutics); and Atea Pharmaceuticals, Inc (related to COVID-19 therapeutics). BTG reports research funding from Johns Hopkins inHealth (the Johns Hopkins Precision Medicine Initiative) and the John Templeton Foundation. All other authors declare no other conflicts of interest.

Figures

Figure 1
Figure 1
Missingness of clinical observations used for clustering. Clinical physiological observations associated with included patients (adults; nontransferees) with missingness of <25% in the population (N=4379) over the entire encounter. ALT: alanine transaminase; AST: aspartate aminotransferase; BUN: blood urea nitrogen; CO2: carbon dioxide; CRP: C-reactive protein; MCH: mean corpuscular hemoglobin; MCV: mean corpuscular volume; MPV: mean platelet volume; NLR: neutrophil-to-lymphocyte ratio; PLR: platelet-to-lymphocyte ratio; RBC: red blood cell count; RDW: red cell distribution width; SBP: systolic blood pressure; SFR: oxygen saturation–to–fraction of inspired oxygen ratio; SpO2: oxygen saturation; WBC: white blood cell count.
Figure 2
Figure 2
Heat map of correlations among clinical data used to generate clustering features showing highly uncorrelated data except for expected positive correlations between red blood cell count (RBC), hemoglobin, and hematocrit and correlations between white blood cell count (WBC) and lymphocytes or neutrophils. ALT: alanine transaminase; AST: aspartate aminotransferase; BUN: blood urea nitrogen; CO2: carbon dioxide; CRP: C-reactive protein; MCH: mean corpuscular hemoglobin; MCV: mean corpuscular volume; MPV: mean platelet volume; NLR: neutrophil-to-lymphocyte ratio; PLR: platelet-to-lymphocyte ratio; RDW: red cell distribution width; SBP: systolic blood pressure; SFR: oxygen saturation–to–fraction of inspired oxygen ratio; SpO2: oxygen saturation.
Figure 3
Figure 3
Weighted consensus clustering for COVID-19 phenotype identification process flow. FiO2: fraction of inspired oxygen.
Figure 4
Figure 4
Predictive gradient-boosting machine phenotype classifier derivation process flow. AUC: area under the curve; FiO2: fraction of inspired oxygen; GBM: gradient-boosting machine; NPV: negative predictive value; PPV: positive predictive value.
Figure 5
Figure 5
Demonstration that k=2 is the optimal number of clusters based on instability analysis for both the training and validation data sets.
Figure 6
Figure 6
Rank plots showing agreement in the most significant phenotype-defining features (eg, age, blood urea nitrogen [BUN], mean corpuscular volume [MCV], creatinine, neutrophil-to-lymphocyte ratio [NLR], red blood cell count [RBC], hemoglobin, and hematocrit) across phenotypes in both the training and validation data sets. ALT: alanine transaminase; AST: aspartate aminotransferase; CO2: carbon dioxide; CRP: C-reactive protein; MCH: mean corpuscular hemoglobin; MPV: mean platelet volume; PLR: platelet-to-lymphocyte ratio; RDW: red cell distribution width; SBP: systolic blood pressure; SFR: oxygen saturation–to–fraction of inspired oxygen ratio; SpO2: oxygen saturation; WBC: white blood cell count.
Figure 7
Figure 7
Violin plots of clustered features showing highly similar distributions and densities of features across phenotypes in both cohorts. ALT: alanine transaminase; AST: aspartate aminotransferase; BUN: blood urea nitrogen; CO2: carbon dioxide; CRP: C-reactive protein; MCH: mean corpuscular hemoglobin; MCV: mean corpuscular volume; MPV: mean platelet volume; NLR: neutrophil-to-lymphocyte ratio; PLR: platelet-to-lymphocyte ratio; RBC: red blood cell count; RDW: red cell distribution width; SBP: systolic blood pressure; SFR: oxygen saturation–to–fraction of inspired oxygen ratio; SpO2: oxygen saturation; WBC: white blood cell count.
Figure 8
Figure 8
Differences in inflammatory biomarkers across phenotypes showing that phenotype 2, associated with hyperinflammatory biomarkers, was not used in clustering (D-dimer, ferritin, fibrinogen, interleukin 6 [IL6], lactate dehydrogenase [LDH], and procalcitonin [PCT]). CRP: C-reactive protein.
Figure 9
Figure 9
Adjusted odds ratios of comorbidities to clinical phenotypes showing similar associations between comorbidities and high severity (phenotype 2) of COVID-19 in both cohorts.
Figure 10
Figure 10
Principal-component analysis (PCA) biplot (training data) showing “good” cluster separation or spatial distribution and similar feature loading (correlations between key phenotype-defining features and principal components) with validation PCA. ALT: alanine transaminase; AST: aspartate aminotransferase; BUN: blood urea nitrogen; CO2: carbon dioxide; CRP: C-reactive protein; MCH: mean corpuscular hemoglobin; MCV: mean corpuscular volume; MPV: mean platelet volume; NLR: neutrophil-to-lymphocyte ratio; PLR: platelet-to-lymphocyte ratio; RBC: red blood cell count; RDW: red cell distribution width; Resp_rate: respiratory rate; SBP: systolic blood pressure; SFR: oxygen saturation–to–fraction of inspired oxygen ratio; SpO2: oxygen saturation; TEMP: temperature; WBC: white blood cell count.
Figure 11
Figure 11
Principal-component analysis (PCA) biplot (validation data) showing “good” cluster separation or spatial distribution and similar feature loading (direction or magnitude) of correlations between key phenotype-defining features and principal components with validation PCA. ALT: alanine transaminase; AST: aspartate aminotransferase; BUN: blood urea nitrogen; CO2: carbon dioxide; CRP: C-reactive protein; MCH: mean corpuscular hemoglobin; MCV: mean corpuscular volume; MPV: mean platelet volume; NLR: neutrophil-to-lymphocyte ratio; PLR: platelet-to-lymphocyte ratio; RBC: red blood cell count; RDW: red cell distribution width; Resp_rate: respiratory rate; SBP: systolic blood pressure; SFR: oxygen saturation–to–fraction of inspired oxygen ratio; SpO2: oxygen saturation; TEMP: temperature; WBC: white blood cell count.
Figure 12
Figure 12
Survival curves for patients in phenotype 2 versus phenotype 1 (days) showing significantly lower survival in phenotype 2 versus phenotype 1 in both the training and validation cohorts.

Similar articles

References

    1. Wang D, Hu B, Hu C, Zhu F, Liu X, Zhang J, Wang B, Xiang H, Cheng Z, Xiong Y, Zhao Y, Li Y, Wang X, Peng Z. Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in Wuhan, China. JAMA. 2020 Mar 17;323(11):1061–9. doi: 10.1001/jama.2020.1585. https://europepmc.org/abstract/MED/32031570 2761044 - DOI - PMC - PubMed
    1. Yang X, Yu Y, Xu J, Shu H, Xia J, Liu H, Wu Y, Zhang L, Yu Z, Fang M, Yu T, Wang Y, Pan S, Zou X, Yuan S, Shang Y. Clinical course and outcomes of critically ill patients with SARS-CoV-2 pneumonia in Wuhan, China: a single-centered, retrospective, observational study. Lancet Respir Med. 2020 May;8(5):475–81. doi: 10.1016/S2213-2600(20)30079-5. https://europepmc.org/abstract/MED/32105632 S2213-2600(20)30079-5 - DOI - PMC - PubMed
    1. Cidade JP, Coelho L, Costa V, Morais R, Moniz P, Morais L, Fidalgo P, Tralhão A, Paulino C, Nora D, Valério B, Mendes V, Tapadinhas C, Povoa P. Septic shock 3.0 criteria application in severe COVID-19 patients: an unattended sepsis population with high mortality risk. World J Crit Care Med. 2022 Jul 09;11(4):246–54. doi: 10.5492/wjccm.v11.i4.246. https://www.wjgnet.com/2220-3141/full/v11/i4/246.htm - DOI - PMC - PubMed
    1. González J, Benítez ID, de Gonzalo-Calvo D, Torres G, de Batlle J, Gómez S, Moncusí-Moix A, Carmona P, Santisteve S, Monge A, Gort-Paniello C, Zuil M, Cabo-Gambín R, Manzano Senra C, Vengoechea Aragoncillo JJ, Vaca R, Minguez O, Aguilar M, Ferrer R, Ceccato A, Fernández L, Motos A, Riera J, Menéndez R, Garcia-Gasulla D, Peñuelas O, Labarca G, Caballero J, Barberà C, Torres A, Barbé F, CIBERESUCICOVID Project (COV20/00110‚ ISCIII) Impact of time to intubation on mortality and pulmonary sequelae in critically ill patients with COVID-19: a prospective cohort study. Crit Care. 2022 Jan 10;26(1):18. doi: 10.1186/s13054-021-03882-1. https://ccforum.biomedcentral.com/articles/10.1186/s13054-021-03882-1 10.1186/s13054-021-03882-1 - DOI - DOI - PMC - PubMed
    1. Colon Hidalgo D, Patel J, Masic D, Park D, Rech MA. Delayed vasopressor initiation is associated with increased mortality in patients with septic shock. J Crit Care. 2020 Feb;55:145–8. doi: 10.1016/j.jcrc.2019.11.004.S0883-9441(19)30911-6 - DOI - PubMed

Publication types