Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Nov 13:2024.11.13.24316854.
doi: 10.1101/2024.11.13.24316854.

Improving Differentiation of Crohn's Disease and Ulcerative Colitis Proteomes through Protein-Wide Association Study Feature Selection in Machine Learning

Affiliations

Improving Differentiation of Crohn's Disease and Ulcerative Colitis Proteomes through Protein-Wide Association Study Feature Selection in Machine Learning

Mark G Gorelik et al. medRxiv. .

Abstract

Background and aims: Diagnostic differentiation between Crohn's disease (CD) and ulcerative colitis (UC) is crucial for timely and suitable therapeutic measures. The current gold standard for differentiating between CD and UC involves endoscopy and histology, which are invasive and costly. We aimed to identify blood plasma proteomic signatures using a Protein-Wide Association Study (PWAS) approach to differentiate CD from UC and evaluate the efficacy of these signatures as features in machine learning (ML) classifiers.

Methods: Among participants (n=1,106; nCD=636; nUC=470) of the Study of a Prospective Adult Research Cohort with IBD (SPARC), plasma protein (n=2,920) levels were estimated using Olink proteomics. A PWAS with Bonferroni correction for multiple testing was used to identify proteins associated with disease states after controlling for age, sex, and disease severity. ML classifiers examined the diagnostic utility of these models. Feature importance was determined via SHapley Additive exPlanations (SHAP) analysis.

Results: Thirteen proteins which were significantly differentially abundant in CD vs UC (all |β|s > 0.22, all adjusted p values < 8.42E-06). Random forest models of proteins differentiated between CD and UC with models trained only on PWAS identified proteins (Average ROC-AUC 0.73) outperforming models trained of the full proteome (Average ROC-AUC 0.62). SHAP analysis revealed that Granzyme B, insulin-like peptide 5 (INSL5), and interleukin-12 subunit beta (IL-12B) were the most important features.

Conclusions: Our findings demonstrate that PWAS-based feature selection approaches are a powerful method to identify features in complex, noisy datasets. Importantly, we have identified novel peptide based biomarkers such as INSL5, that can be potentially used to complement existing strategies to differentiate between CD and UC.

Keywords: IBD; Machine Learning; PWAS.

PubMed Disclaimer

Conflict of interest statement

PD: has received research support under a sponsored research agreement unrelated to the data in the paper and/or consulting from AbbVie, Arena Pharmaceuticals, Boehringer Ingelheim, Bristol Myers Squibb, Janssen, Pfizer, Prometheus Biosciences, Takeda Pharmaceuticals, Roche Genentech, Scipher Medicine, Fresenius Kabi, Teva Pharmaceuticals, Landos Pharmaceuticals, Iterative scopes and CorEvitas, LLC. U.J. has received research support from Boehringer Ingelheim.

Figures

Figure 1.
Figure 1.. Sample processing and analysis pipeline.
Blood plasma samples were collected and processed as described in the methods and materials. Differentially abundant proteins were identified in the PWAS analysis. Protein abundance was used as features for the machine learning models to classify CD from UC.
Figure 2.
Figure 2.. PWAS analysis enables separation of the proteomic profiles of Ulcerative colitis and Crohn’s disease.
A) Principal Component Analysis (PCA) of the global proteomics profiles of Crohn’s disease and Ulcerative colitis. B) Volcano plot where the × axis is the calculated beta and the y axis is the negative log10 of the unadjusted p-value; green and labeled points had a Bonferroni adjusted p-value of less than 0.0000171 (used in Fig 1B), orange points had an FDR adjusted p value of less than .05, and purple points represent proteins with a p-value > .05. Negative beta values are associated with Crohn’s disease and positive beta values are associated with Ulcerative colitis. C) PCA of the proteomics profile identified by the PWAS analysis. Ellipses represent 95% confidence bounds around group centroids.
Figure 3.
Figure 3.. Specific proteins improve machine learning based differentiation of CD and UC.
A) Effect of feature set on model accuracy. B) Effect of feature set on model sensitivity. C) Effect of feature set on model specificity. D) Effect of feature set on model ROC-AUC. ***P<.001, ANOVA with Tukey’s post hoc test.
Figure 4.
Figure 4.. Clinical Features do not improve model performance.
A) SHAP beeswarm plot of the validation dataset indicating feature importance in random forest models trained on patient associated features (Age, Sex, Disease Severity) and the thirteen proteins which are significantly associated with Crohn’s disease and Ulcerative colitis. B) SHAP beeswarm plot of the validation dataset indicating feature importance in random forest models trained on just the thirteen proteins which are significantly associated with Crohn’s disease and ulcerative colitis. Features are sorted in order of predicted importance in a descending manner.

References

    1. Raffals LE., Saha S., Bewtra M., Norris C., Dobes A., Heller C., et al. The Development and Initial Findings of A Study of a Prospective Adult Research Cohort with Inflammatory Bowel Disease (SPARC IBD). Inflammatory Bowel Diseases 2022;28(2):192–9. Doi: 10.1093/ibd/izab071. - DOI - PMC - PubMed
    1. Kaplan GG. The global burden of IBD: from 2015 to 2025. Nat Rev Gastroenterol Hepatol 2015;12(12):720–7. Doi: 10.1038/nrgastro.2015.150. - DOI - PubMed
    1. Duryee MJ., Ahmad R., Eichele DD., Hunter CD., Mitra A., Talmon GA., et al. Identification of Immunoglobulin G Autoantibody Against Malondialdehyde-Acetaldehyde Adducts as a Novel Serological Biomarker for Ulcerative Colitis. Clin Transl Gastroenterol 2022;13(4):e00469. Doi: 10.14309/ctg.0000000000000469. - DOI - PMC - PubMed
    1. Soriano CR., Powell CR., Chiorean MV., Simianu VV. Role of hospitalization for inflammatory bowel disease in the post-biologic era. World J Clin Cases 2021;9(26):7632–42. Doi: 10.12998/wjcc.v9.i26.7632. - DOI - PMC - PubMed
    1. The global, regional, and national burden of inflammatory bowel disease in 195 countries and territories, 1990–2017: a systematic analysis for the Global Burden of Disease Study 2017. Lancet Gastroenterol Hepatol 2019;5(1):17–30. Doi: 10.1016/S2468-1253(19)30333-4. - DOI - PMC - PubMed

Publication types

LinkOut - more resources