Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Jun 4:2025.06.02.25328786.
doi: 10.1101/2025.06.02.25328786.

Machine Learning Analysis of Electronic Health Records Identifies Interstitial Lung Disease and Predicts Mortality in Patients with Systemic Sclerosis

Affiliations

Machine Learning Analysis of Electronic Health Records Identifies Interstitial Lung Disease and Predicts Mortality in Patients with Systemic Sclerosis

Alec K Peltekian et al. medRxiv. .

Abstract

Background: Interstitial lung disease (ILD) is the leading cause of death in patients with systemic sclerosis (SSc), affecting more than 40% of this population. Despite the availability of effective treatments to stabilize or improve lung function, survival for patients with SSc-ILD remains poor. Poor outcomes have been attributed to delayed diagnosis and initiation of treatment for SSc-ILD. Although recent guidelines have provided conditional recommendations for early screening, pulmonary function tests (PFTs) are insensitive for early diagnosis, and computed tomography (CT)-the current gold standard-often detects disease after irreversible lung injury has occurred. A single sensitive biomarker that can accurately predict the risk of SSc-ILD development and mortality is lacking. We hypothesized that applying machine learning (ML) methods to multiple features from readily available electronic health records (EHR) could construct a model to detect ILD and predict mortality in patients with SSc.

Methods: We retrospectively analyzed EHR data from participants enrolled in a single-center registry of patients with SSc over a period of twenty-eight years (1995-2024). We applied a combination of ML models to seventy-four clinical features encompassing demographics, clinical history, PFTs, and laboratory results. The resultant models were tasked with detecting ILD and predicting mortality in participants with SSc.

Results: 1,169 participants with SSc were included in this study, spanning 15,494 person-years of observation. Models detecting ILD achieved an AUC of 0.818 and confirmed the importance of known biomarkers, such as autoantibodies and PFTs, as risk factors for SSc-ILD. Unexpected clinical values including white blood cell count and mean corpuscular volume were also important for model prediction of SSc-ILD. For prediction of one-year all-cause mortality, models reached an AUC of 0.903. In a subgroup analysis of those with prevalenet radiographic SSc-ILD, three-year all-cause mortality prediction reached an AUC of 0.831. These models identified features strongly associated with mortality that are routinely collected during clinical assessment of patients with SSc, including unexpected associations with values such as red cell distribution width and serum chloride concentration.

Conclusions: ML-based analysis of clinical features and laboratory tests collected as part of routine clinical care detect ILD and predict mortality in patients with SSc.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. Labeling strategy and modeling tasks for ILD detection, mortality prediction in all SSc participants, and mortality prediction in participants with SSc-ILD.
(A): ILD Detection. Displays the timeline for ILD detection based on CT scan results. Participants with no evidence of ILD are labeled as “0,” transitioning to “−1” during uncertain periods (e.g., after negative CT and before CT establishing ILD diagnosis) and “1” after ILD is confirmed by CT. Each green marker represents a CT scan without ILD, while red markers represent CTs confirming ILD. (B): Mortality Prediction in all SSc participants. Labels are determined by proximity to the death/lung transplant event, represented by the brown vertical line. If the end of a yearly bin falls within the specified prediction window from the time of death/transplant, the label is “1” (blue); otherwise, it is “0” (green) (e.g., Death_3 = Participant Died within 3 years from the end of the current year bin. (C): Mortality Prediction in participants with SSc-ILD. Focuses on SSc-ILD participants, where mortality prediction begins one year before ILD diagnosis (marked in red). Annual prediction bins are used. Similar to Panel B, the brown line marks the time of death/lung transplant. Labels are “1” (blue) if the end of a bin falls within the corresponding prediction window from the death/transplant event, and “0” (green) otherwise.
Figure 2.
Figure 2.. Longitudinal characteristics of the Northwestern Scleroderma Registry cohort.
(A) Distribution of participant follow-up duration binned by number of years. (B) Number of clinical encounters for the aggregate Registry cohort over time. (C) Cumulative number of active participants (blue), lost to follow-up (orange), and death (red) over time. (D) Age at SSc diagnosis subgrouped by sex. (E) Kaplan-Meier survival curves comparing SSc participants with and without ILD within the CT Subgroup cohort. NOTE: Of the CT Subgroup cohort (n=709), 9 participants were excluded from the survival analysis due to missing non-Raynaud’s onset data.
Figure 3.
Figure 3.. Distribution of ILD diagnosis by expert adjudication of radiologic reports in SSc participants in the Northwestern University Scleroderma Registry.
Figure 4.
Figure 4.. Hierarchical clustering of SSc participants reveals distinct clinical and ILD-associated phenotypic subgroups.
Each vertical column represents an individual participant of the 1,169 participants with SSc in the full cohort. The top section includes participant demographics and clinical characteristics, which were not employed in cluster analysis. The bottom section displays hierarchical clustering (clusters=10) results.
Figure 5.
Figure 5.. Model performance and feature importance in ILD detection.
Performance of the LightGBM model in ILD detection. The ROC curve (A) shows strong discriminative ability (AUC 0.818). SHAP analysis (B) identifies key features of ILD detection.
Figure 6.
Figure 6.. Model performance and feature importance in mortality prediction.
ROC curves (A) for mortality prediction over one-, three-, and five-year intervals show declining performance with longer time horizons. Feature importance plots (B, C, D) highlight shifting predictors from laboratory values in the short term to chronic disease markers over longer periods.
Figure 7.
Figure 7.. Model performance and feature importance in mortality prediction for participants with SSc-ILD.
ROC curves (A) for mortality prediction in participants with confirmed SSc-ILD over one-, three-, and five-years show distinct patterns compared to mortality in the general SSc cohort. Feature importance plots (B,C,D) reveal evolving predictors, from vital signs and labs at one-year to demographics and chronic disease markers by five-years.

References

    1. Systemic sclerosis. Lancet 401, 304–318 (2023). - PMC - PubMed
    1. Khanna D. et al. Etiology, Risk Factors, and Biomarkers in Systemic Sclerosis with Interstitial Lung Disease. Am. J. Respir. Crit. Care Med. 201, 650–660 (2020). - PMC - PubMed
    1. Elhai M. et al. Mapping and predicting mortality from systemic sclerosis. Ann. Rheum. Dis. 76, 1897–1905 (2017). - PubMed
    1. Tyndall A. J. et al. Causes and risk factors for death in systemic sclerosis: a study from the EULAR Scleroderma Trials and Research (EUSTAR) database. Ann. Rheum. Dis. 69, 1809–1815 (2010). - PubMed
    1. Walker U. A. et al. Clinical risk assessment of organ manifestations in systemic sclerosis: a report from the EULAR Scleroderma Trials And Research group database. Ann. Rheum. Dis. 66, 754–763 (2007). - PMC - PubMed

Publication types

LinkOut - more resources