Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 5;121(45):e2417688121.
doi: 10.1073/pnas.2417688121. Epub 2024 Oct 30.

Robust extraction of pneumonia-associated clinical states from electronic health records

Affiliations

Robust extraction of pneumonia-associated clinical states from electronic health records

Feihong Xu et al. Proc Natl Acad Sci U S A. .

Abstract

Mining of electronic health records (EHR) promises to automate the identification of comprehensive disease phenotypes. However, the realization of this promise is hindered by the unavailability of generalizable ground-truth information, data incompleteness and heterogeneity, and the lack of generalization to multiple cohorts. We present here a data-driven approach to identify clinical states that we implement for 585 critical care patients with suspected pneumonia recruited by the SCRIPT study, which we compare to and integrate with 9,918 pneumonia patients from the MIMIC-IV dataset. We extract and curate from their structured EHRs a primary set of clinical features (53 and 59 features for SCRIPT and MIMIC-IV, respectively), including disease severity scores, vital signs, and so on, at various degrees of completeness. We aggregate irregular time series into daily frequency, resulting in 12,495 and 94,684 patient-day pairs for SCRIPT and MIMIC, respectively. We define a "common-sense" ground truth that we then use in a semisupervised pipeline to optimize choices for data preprocessing, and reduce the feature space to four principal components. We describe and validate an ensemble-based clustering method that enables us to robustly identify five clinical states, and use a Gaussian mixture model to quantify uncertainty in cluster assignment. Demonstrating the clinical relevance of the identified states, we find that three states are strongly associated with disease outcomes (dying vs. recovering), while the other two reflect disease etiology. The outcome associated clinical states provide significantly increased discrimination of mortality rates over standard approaches.

Keywords: EHR mining; clustering; high dimensionality; multicenter integration.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement:The authors declare no competing interest.

Figures

Fig. 1.
Fig. 1.
Illustration of study workflow. (A) Hospitalization timeline for a representative patient with one ICU stay (thin gray line), who undergoes mechanical ventilation (light yellow bar) and extracorporeal membrane oxygenation (ECMO; light red bar). Before discharge to home, the patient spends some days in the ward (thicker gray bar). The patient undergoes three broncho-alveolar lavages (BALs, purple diamond), which yield diagnoses of community-acquired pneumonia (CAP) with viral infection (first BAL) and ventilator-associated pneumonia (VAP) with bacterial and viral coinfection (second and third BALs). (B) Illustration of data processing at single center level. We extract clinical features from structured EHRs. We then identify a common-sense ground-truth and select preprocessing steps that balance maximization of discrimination of extreme states while minimizing data loss. We learn a low-dimensionality embedding space using PCA and use the described ensemble DPC approach to reliably and robustly identify clinical states. We associate those clinical states with patient outcomes and with disease etiology by studying transitions between clinical states. (C) To integrate multicenter cohorts, we identify common features, characterize similarities of embedding spaces, and determine the embedding that provides the richest characterization of the data.
Fig. 2.
Fig. 2.
Low-dimensionality embedding space learned from SOFA subscores and vital signs captures diversity of patient data. (A) We compare explained variance in data vs. their randomization to determine the number of significant PCs for SCRIPT (Left) and MIMIC-IV (Right) cohorts. We plot the fraction of variance explained by each PC for the data (purple solid line) and for shuffled data (blue dashed line). Error bars show 90% CI constructed by bootstrapping and the red arrow shows the last significant PC. (B) Top feature loadings of the significant PCs for SCRIPT and MIMIC cohorts. Red indicates positive loadings and blue indicates negative loadings. Greater color saturation indicates larger magnitude. (C) Within-distribution and out-of-distribution performance of models for discriminating extreme states. We compute the AUC for SVM models trained on the SCRIPT training dataset, using SOFA subscores and vital signs as features, on the SCRIPT test dataset (green solid line) and MIMIC-IV dataset (green dashed line). We find outstanding performance. We also compute the AUC for SVM models trained on the MIMIC-IV training dataset, using SOFA subscores and vital signs as features, on the MIMIC-IV test dataset (orange solid line) and on the SCRIPT dataset (orange dashed line). We find good but lower performance. (D) Projections of combined distributions of patient-day vectors onto learned embedding spaces learned from the SCRIPT training dataset. It is visually apparent that the two cohorts have different characteristics.
Fig. 3.
Fig. 3.
Clinical states are associated with patient outcomes and disease etiology. (A) GMM model trained on SCRIPT training dataset (Right panel). Cluster membership of patient-day vectors for the SCRIPT testing dataset (Middle panel) and the MIMIC-IV cohort (Left panel). (B) Proportion of patient-day vectors classified into each of the five clinical states for patients stratified by patient outcome and disease etiology. Note strong association of clusters C1, C2, and C5 to patient outcomes and of cluster C4 to COVID-19.
Fig. 4.
Fig. 4.
Clinical states yield greater discriminatory power of patient outcomes than SOFA scores at short time horizons and provide earlier insight into disease etiology. (Top row) Next-day mortality rates for patients with patient-day vectors in clusters C1, C2, or C5 and or SOFA score in quintiles Q1, Q2, or Q5. Number over bars show ratio of mortality rates between two groups. Across all three datasets, clinical states provide greater stratification of patient mortality than SOFA scores. (Bottom row) Percentage of patients with a COVID-19 episode diagnosis (see SI Appendix, Text for details) for patients with a patient-day vector in clusters C3 or C4 and or SOFA score in quintiles Q3 or Q4. Number over bars show ratio of COVID-19 diagnosis rates between two groups. Patients diagnosed with COVID-19 are overrepresented on cluster C4.

References

    1. Storms A. D., et al. , Rates and risk factors associated with hospitalization for pneumonia with ICU admission among adults. BMC Pulm. Med. 17, 208 (2017). - PMC - PubMed
    1. Morris A. C., Management of pneumonia in intensive care. J. Emerg. Crit. Care Med. 2, 101 (2018).
    1. Mackenzie G., The definition and classification of pneumonia. Pneumonia 8, 14 (2016). - PMC - PubMed
    1. Kronberger J. F., et al. , Bronchoalveolar lavage and blood markers of infection in critically ill patients—A single center registry study. J. Clin. Med. 10, 486 (2021). - PMC - PubMed
    1. Waterer G., Severity scores and community-acquired pneumonia. Time to move forward. Am. J. Respir. Crit. Care Med. 196, 1236–1238 (2017). - PubMed

LinkOut - more resources