Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 12:18:17562848251362391.
doi: 10.1177/17562848251362391. eCollection 2025.

Identifying inflammatory bowel disease subtypes: a comprehensive exploration of transcriptomic data and machine learning-based approaches

Affiliations

Identifying inflammatory bowel disease subtypes: a comprehensive exploration of transcriptomic data and machine learning-based approaches

Niyati Saini et al. Therap Adv Gastroenterol. .

Abstract

Background: Inflammatory bowel disease (IBD), encompassing Crohn's disease (CD) and ulcerative colitis (UC), is a heterogeneous condition characterised by chronic gastrointestinal inflammation and dysregulated immune responses. Despite advances in transcriptomic analysis and machine learning (ML), consistent molecular subtyping across datasets remains a challenge. There is a critical need for robust subtypes that reflect disease heterogeneity and correlate with clinical outcomes.

Objectives: Unlike prior studies focused on either UC or CD or based on small datasets, this study analyses a large-scale RNA sequencing (RNA-seq) dataset to identify transcriptomic subtypes in both UC and CD.

Design: We analysed RNA-seq data from four prospective cross-sectional cohorts from Gene Expression Omnibus: GSE193677, GSE186507, GSE137344 and GSE235236.

Methods: Analysed RNA-sequenced data from inflamed and non-inflamed intestinal biopsies of 2490 adult IBD patients. K-means clustering was applied independently to UC and CD samples to identify transcriptomic clusters. Gene set enrichment and network analyses explored molecular characteristics. Associations with clinical metadata, including disease severity and anatomical involvement, were assessed using Chi-square and analysis of variance tests.

Results: K-means clustering revealed three distinct transcriptomic subtypes in both UC and CD. In UC, Cluster 1 was enriched for RNA processing and DNA repair genes; Cluster 2 highlighted autophagy, stress responses and upregulation of ATG13, VPS37C and DVL2; Cluster 3 emphasised cytoskeletal organisation (SRF, SRC and ABL1). In CD, Cluster 1 featured cytoskeletal remodelling and suppressed protein synthesis (CFL1, F11R and RAD23A), while Cluster 2 upregulated stress and translation pathways. Cluster 3 again prioritised cytoskeletal structure over metabolic activity. Cluster 3 in both conditions was significantly associated with moderate-to-severe endoscopic activity; Cluster 1 was enriched in inactive or mild disease.

Conclusion: We report three transcriptomic subtypes in UC and CD, each with distinct molecular signatures and clinical relevance. These findings support a stratified approach to IBD diagnosis and therapy, enabling more personalised disease management strategies.

Keywords: Crohn’s disease; IBD subtypes; machine learning; transcriptomics; ulcerative colitis.

Plain language summary

Identification of the IBD subtype using machine learning Inflammatory Bowel Disease (IBD) is a complex gastrointestinal disorder affecting millions worldwide. This groundbreaking study analyzed RNA sequencing data from 2490 adult patients, revealing three distinct molecular subtypes for both Ulcerative Colitis and Crohn’s Disease. By examining gene expression patterns in intestinal biopsies, researchers identified unique clusters characterized by different cellular processes like RNA processing, cytoskeletal dynamics, and stress responses. Each subtype showed specific gene upregulation and distinct molecular signatures. The research used Kmeans clustering and statistical analysis to link these subtypes with disease severity and regional variations. This innovative approach provides deeper insights into IBD’s molecular mechanisms, potentially paving the way for more personalized treatment strategies.

PubMed Disclaimer

Figures

Generate a table json
Figure 1.
Workflow of data analysis displaying different steps and methodologies adapted in each phase. CD, Crohn’s disease; GEO, Gene Expression Omnibus; GO, gene ontology; IBD, inflammatory bowel disease; KEGG, Kyoto Encyclopaedia for Genes and Genomes; ML, machine learning; n, number of samples; UC, ulcerative colitis; WGCNA, weighted gene co-relation network analysis.
Results from K value selection and clustering analysis. (a, b) Elbow plot showing k=3 as the appropriate K-value for UC and CD datasets, respectively. (c, d) PCA plot of K-means clustering showing three clusters within UC and CD samples, respectively. (a, b) X-axis represents the number of clusters (k), Y-axis depicts WCSS and the red dotted line indicates the optimum value of k. (c, d) X-axis and Y-axis show PC1 and PC2, respectively, with variance in %.
Figure 2.
Results from K value selection and clustering analysis. (a, b) Elbow plot showing k = 3 as the appropriate K-value for UC and CD datasets, respectively. (c, d) PCA plot of K-means clustering showing three clusters within UC and CD samples, respectively. (a, b) X-axis represents the number of clusters (k), Y-axis depicts WCSS and the red dotted line indicates the optimum value of k. (c, d) X-axis and Y-axis show PC1 and PC2, respectively, with variance in %. CD, Crohn’s disease; PCA, principal component analysis; UC, ulcerative colitis; WCSS, within-cluster sum of squares.
Cancer, Biology, Gene, Expression, UC, CD, Volcano, Plots, Venn, Circles, Significant, Differential, Analysis, Clusters, Log2FC, Log2FC, Cluster, 16
Figure 3.
Differential gene expression analysis on UC and CD samples with adjusted p-value < 0.001 and Log2FC = 1. (a, b) Enhanced volcano plot of CD cluster 3 and UC cluster 3 showing statistically significant genes (c). Venn diagram showing common and unique significant genes in UC and CD samples across the clusters in the primary dataset (d). Venn diagram showing common and unique significant genes in both the primary and validation dataset across all UC and CD clusters. CD, Crohn’s disease; log2FC, log 2 fold change; UC, ulcerative colitis.
This image consists of four subfigures showcasing visualization of enriched pathways in two health conditions, UC (ulcerative colitis) and CD (Crohn’s disease), represented by clusters 3 for both. Subfigure (a) highlights the top 5 enriched pathways in UC with a dot plot, with the size of each bubble indicating the number of genes involved in each pathway. Subfigure (b) illustrates the top 5 enriched pathways in CD in a similar dot plot format. Subfigure (c) takes a different approach by presenting a network plot where the top 5 enriched pathways in UC are shown with red nodes, and genes are depicted with blue nodes, illustrating the interactions or associations between them. Subfigure (d) combines these two approaches by showcasing the top enriched pathways across both UC and CD in a bar plot. This figure provides a comprehensive overview of the molecular mechanisms underlying these health conditions, potentially aiding in the development of targeted therapeutic strategies.
Figure 4.
Visualisation of enriched pathways in UC and CD clusters. (a, b) Dot plot showing the top 5 enriched upregulated Reactome and KEGG pathways, respectively, in UC Cluster 3. Each bubble represents a pathway, with its size indicating the number of genes involved. (c) Network plot of the top 5 enriched upregulated GO pathways in CD Cluster 3. Red nodes represent pathways, while blue nodes represent genes. Edges (lines) denote interactions or associations between pathways and genes. (d) Combined bar plot illustrating the top enriched upregulated GO pathways across UC and CD clusters. CD, Crohn’s disease; GO, gene ontology; KEGG, Kyoto Encyclopaedia of Genes and Genomes; log2FC, log 2 fold change; UC, ulcerative colitis.
Cluster Dendrogram Showing Gene Clusters from WGCNA, Dynamic Tree Cut, and Merged Dynamics Colors Assigned to Branches
Figure 5.
A cluster dendrogram illustrating the arrangement of clusters of genes produced by WGCNA. On the bottom, the ‘Dynamic Tree Cut’ and ‘Merged Dynamics’ show the module colour assigned to the branch. WGCNA, weighted gene co-expression network analysis.
Characterisation of the primary dataset reveals (1) cluster properties and (2) potential therapeutic targets for those clusters in Ulcerative Colitis and Crohn’s disease.
Figure 6.
Characterisation of clusters in the primary dataset shows (1) the characteristics of the cluster and (2) the potential therapeutic targets for that cluster.
The heatmap displays expression levels of genes within various UC clusters; it shows clusters, gender, age, and severity, detailing how genes are upregulated within the UC pathways.
Figure 7.
Heatmap of genes from the top 5 upregulated GO pathways in UC (primary dataset). This heatmap visualises the expression levels of genes from the top 5 upregulated GO pathways in UC clusters within the primary dataset. Samples are annotated by clusters and clinical features, including gender, age, endoscopic severity and region. The plot highlights the differential expression patterns of genes across clusters, emphasising their roles in upregulated GO pathways. GO, gene ontology; UC, ulcerative colitis.
The heatmap shows gene expression levels in UC clusters, highlighting differential patterns.
Figure 8.
Heatmap of genes from the top 5 downregulated GO pathways in UC (primary dataset). This heatmap visualises the expression levels of genes from the top 5 downregulated GO pathways in UC clusters within the primary dataset. Samples are annotated by clusters and clinical features, including gender, age, endoscopic severity and region. The plot highlights the differential expression patterns of genes across clusters, emphasising their roles in downregulated GO pathways. GO, gene ontology; UC, ulcerative colitis.

Similar articles

References

    1. Zhang Y-Z. Inflammatory bowel disease: pathogenesis. World J Gastroenterol 2014; 20: 91. - PMC - PubMed
    1. Saez A, Herrero-Fernandez B, Gomez-Bris R, et al. Pathophysiology of inflammatory bowel disease: innate immune system. Int J Mol Sci 2023; 24: 1526. - PMC - PubMed
    1. Santana PT, Rosas SLB, Ribeiro BE, et al. Dysbiosis in inflammatory bowel disease: pathogenic role and potential therapeutic targets. Int J Mol Sci 2022; 23: 3464. - PMC - PubMed
    1. Kraneveld AD, Rijnierse A, Nijkamp FP, et al. Neuro-immune interactions in inflammatory bowel disease and irritable bowel syndrome: future therapeutic targets. Eur J Pharmacol 2008; 585: 361–374. - PubMed
    1. Pasvol TJ, Horsfall L, Bloom S, et al. Incidence and prevalence of in UK primary care: a population-based cohort study. BMJ Open 2020; 10: e036584. - PMC - PubMed

LinkOut - more resources