Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Aug 15;210(4):444-454.
doi: 10.1164/rccm.202309-1692OC.

Machine Learning of Plasma Proteomics Classifies Diagnosis of Interstitial Lung Disease

Affiliations

Machine Learning of Plasma Proteomics Classifies Diagnosis of Interstitial Lung Disease

Yong Huang et al. Am J Respir Crit Care Med. .

Abstract

Rationale: Distinguishing connective tissue disease-associated interstitial lung disease (CTD-ILD) from idiopathic pulmonary fibrosis (IPF) can be clinically challenging. Objectives: To identify proteins that separate and classify patients with CTD-ILD and those with IPF. Methods: Four registries with 1,247 patients with IPF and 352 patients with CTD-ILD were included in analyses. Plasma samples were subjected to high-throughput proteomics assays. Protein features were prioritized using recursive feature elimination to construct a proteomic classifier. Multiple machine learning models, including support vector machine, LASSO (least absolute shrinkage and selection operator) regression, random forest, and imbalanced Random Forest, were trained and tested in independent cohorts. The validated models were used to classify each case iteratively in external datasets. Measurements and Main Results: A classifier with 37 proteins (proteomic classifier 37 [PC37]) was enriched in the biological process of bronchiole development and smooth muscle proliferation and immune responses. Four machine learning models used PC37 with sex and age score to generate continuous classification values. Receiver operating characteristic curve analyses of these scores demonstrated consistent areas under the curve of 0.85-0.90 in the test cohort and 0.94-0.96 in the single-sample dataset. Binary classification demonstrated 78.6-80.4% sensitivity and 76-84.4% specificity in the test cohort and 93.5-96.1% sensitivity and 69.5-77.6% specificity in the single-sample classification dataset. Composite analysis of all machine learning models confirmed 78.2% (194 of 248) accuracy in the test cohort and 82.9% (208 of 251) in the single-sample classification dataset. Conclusions: Multiple machine learning models trained with large cohort proteomic datasets consistently distinguished CTD-ILD from IPF. Many of the identified proteins are involved in immune pathways. We further developed a novel approach for single-sample classification, which could facilitate honing the differential diagnosis of ILD in challenging cases and improve clinical decision making.

Keywords: connective tissue disease with ILD; differential diagnosis; idiopathic pulmonary fibrosis; machine learning model; plasma proteomics.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Diagram for proteomics data processing and machine learning procedure. (A) PFF Patient Registry training cohort. Samples were filtered on the basis of clinical diagnosis and technical variation. The proteomic classifier was prioritized using RFE. (B) Machine learning classification for UVA/UChicago test cohort (left panel) and single sample in RECITAL/UC-Davis datasets (right panel). AUC = area under the curve; CTD-ILD = connective tissue disease–associated interstitial lung disease; IPF = idiopathic pulmonary fibrosis; LASSO = least absolute shrinkage and selection operator; PCA = principal component analysis; PFF = Pulmonary Fibrosis Foundation; ROC = receiver operating characteristic; SVM = support vector machine; UC-Davis = University of California, Davis; UChicago = University of Chicago; UVA = University of Virginia.
Figure 2.
Figure 2.
Differential protein levels and pathways between connective tissue disease–associated interstitial lung disease (CTD-ILD) and idiopathic pulmonary fibrosis (IPF) in the Pulmonary Fibrosis Foundation (PFF) training cohort. (A) Volcano plot of t test results. Colored dots represent the top significant proteins higher in IPF (right) or CTD-ILD (left), respectively. Log fold changes (FCs; x-axis) and log inverse of Benjamini-Hochberg–adjusted P values (y-axis) of t test results were computed from the average of 100 times random subsampling of the PFF cohort as described in the Methods section. (B) Gene Set Enrichment Analysis (GSEA) of pathways that are activated or suppressed in IPF compared to CTD-ILD.
Figure 3.
Figure 3.
Prioritization and characterization of the proteomic classifier. (A) Recursive feature elimination (RFE) procedure was performed to prioritize the optimized protein features in 100 times random subsampling of the Pulmonary Fibrosis Foundation cohort (Table E5). In this example subsampling for RFE, the optimal size was 138 proteins with 91% accuracy of repeated cross-validation. (B) Partial effects of the selected protein features on IPF probabilities in the training cohort. x-Axis, relative protein quantification unit on a log2 scale (NPX) values of Olink assay. (See the other 31 of 37 features in Figure E4.) (C) Variable importance (VIMP) plot of proteomic classifier 37 (PC37) with sex and age scores in the training cohort. VIMP with confidence intervals (CIs) in the training cohort. Bar represents 95% CI. Red and blue bars represent lower CI boundaries above or below 0, respectively. IPF = idiopathic pulmonary fibrosis.
Figure 4.
Figure 4.
Performance of machine learning (ML) models for disease classification. ML classification supervised by proteomic classifier 37 (PC37) with sex and age scores. (A and B) Classification of UVA/UChicago test cohort. (C and D) Classification of RECITAL/University of California, Davis (UC-Davis), samples. (A and C) Binary classification by four ML models. (B and D) Receiver operating characteristic curve analysis of the continuous classification values. (E) Proportion of patients with connective tissue disease–associated interstitial lung disease (CTD-ILD) or patients with idiopathic pulmonary fibrosis (IPF) with different composite diagnosis scores (CDSs). CDS = 0 or 1, CTD-ILD; CDS = 2, unclassified patients; CDS = 3 or 4, IPF. (F) Decision curve analysis of RECITAL/UC-Davis samples to compare ML models with sex and age. All models and composite classification surpassed sex from 0 to 100% probability and age when probability was >50%. AUC = area under the curve; LASSO = least absolute shrinkage and selection operator; RF = random forest; SVM = support vector machine; UChicago = University of Chicago; UVA = University of Virginia.

Comment in

References

    1. Cottin V, Hirani NA, Hotchkin DL, Nambiar AM, Ogura T, Otaola M, et al. Presentation, diagnosis and clinical course of the spectrum of progressive-fibrosing interstitial lung diseases. Eur Respir Rev . 2018;27:180076. - PMC - PubMed
    1. Raghu G, Remy-Jardin M, Myers JL, Richeldi L, Ryerson CJ, Lederer DJ, et al. American Thoracic Society, European Respiratory Society, Japanese Respiratory Society, and Latin American Thoracic Society Diagnosis of idiopathic pulmonary fibrosis: an official ATS/ERS/JRS/ALAT clinical practice guideline. Am J Respir Crit Care Med . 2018;198:e44–e68. - PubMed
    1. Ageely G, Souza C, De Boer K, Zahra S, Gomes M, Voduc N. The impact of multidisciplinary discussion (MDD) in the diagnosis and management of fibrotic interstitial lung diseases. Can Respir J . 2020;2020:9026171. - PMC - PubMed
    1. Biglia C, Ghaye B, Reychler G, Koenig S, Yildiz H, Lacroix V, et al. Multidisciplinary management of interstitial lung diseases: a real-life study. Sarcoidosis Vasc Diffuse Lung Dis . 2019;36:108–115. - PMC - PubMed
    1. Grewal JS, Morisset J, Fisher JH, Churg AM, Bilawich AM, Ellis J, et al. Role of a regional multidisciplinary conference in the diagnosis of interstitial lung disease. Ann Am Thorac Soc . 2019;16:455–462. - PubMed