Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Dec 13;16(1):1865.
doi: 10.1038/s41598-025-31499-9.

Age-stratified analysis of therapeutic, immune, and glycosylation gene expression in colorectal cancer using machine learning

Affiliations

Age-stratified analysis of therapeutic, immune, and glycosylation gene expression in colorectal cancer using machine learning

Hakan Celik et al. Sci Rep. .

Abstract

Colorectal cancer (CRC) is a major global health issue, yet current treatment strategies rarely consider patient age differences, leading to variable therapeutic efficacy and clinical outcomes. Although numerous biomarkers for CRC have been identified, their age-specific expression profiles and biological implications remain poorly understood. This knowledge gap limits the development and clinical deployment of age-tailored interventions. In this study, we applied an age-aware machine learning framework to uncover gene signatures stratified by age using the GSE44076 microarray dataset. We analyzed three CRC-relevant gene categories (Therapeutic, Immune, and Glycosylation) across three data versions: Original, WithAge (age as a feature), and Age Regressed (residual expression). Patient samples were stratified into younger (< 70 years) and older (≥ 70 years) cohorts to identify age-influenced molecular shifts. Random Forest (RF)-based feature selection yielded compact gene signatures that discriminated against tumor, normal, and mucosa. Under 5 × 10 nested cross-validation, Top-10 gene models achieved Balanced Accuracy ≈ 0.94-0.99 and macro-averaged OvR AUC ≈ 0.96-0.99, while Top-3 sets retained strong performance (Balanced Accuracy ≈ 0.87-0.95). We benchmarked RF against Gradient Boosting Machine (GBM), Support Vector Machine (SVM), and k-Top Scoring Pairs (KTSP) classifiers. RF provided the best overall multivariate metrics under shallow-tree constraints (depth 1-2; min leaf 1-2), KTSP models consistently captured age-dependent gene ranking shifts with enhanced sensitivity, especially in external validation. KTSP's robustness in identifying directional age effects was most evident in the Glycosylation category, where glycan-processing genes showed pronounced age-stratified performance. Permutation tests that replicated the full nested pipeline (N = 100) yielded p≈0.01, confirming that results are unlikely under the null. To uncover deeper regulatory mechanisms, we modeled gene-age interactions, revealing additional biomarkers. External validation using the GSE106582 cohort further supported our findings, although batch harmonization constrained gene coverage by removing platform-unique probes, highlighting a key limitation for microarray-based biomarker translation. Nonetheless, core signatures retained their predictive power, confirming generalizability. Altogether, our results establish age as a critical biological variable influencing CRC gene expression and classifier performance. This study provides a rigorous framework for integrating machine learning, statistical interaction modeling, and biological annotation to guide age-stratified biomarker discovery. Our findings support the development of precision oncology tools tailored to the distinct molecular landscapes of young-onset and late-onset CRC.

Keywords: Age-stratified analysis; Colorectal cancer; Glycosylation; Immune genes; Machine learning; Therapeutic genes.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Flow chart. This flow chart illustrates the overall methodology. After loading and verifying the GSE44076 dataset, we stratified samples by age and filtered for specific functional gene sets (Therapeutic, Immune, Glycosylation). We then transformed the data for machine learning and trained Random Forest models as the primary classifier (with SVM and GBM models run in parallel for comparison), validating performance through cross-validation and permutation testing. Finally, we selected top genes based on RF feature importance and generated volcano plots to visualize age-related differential expression.
Fig. 2
Fig. 2
Age distribution and PCA analysis of colorectal cancer datasets (GSE44076 and GSE106582). (A) Age distribution of samples in raw GSE44076 stratified by sample type (Mucosa, Normal, Tumor). (B) Principal Component Analysis (PCA) of raw GSE44076 expression matrix, colored by sample condition. PC1 (16.1%) clearly separates mucosa/normal samples from tumor samples, while PC2 (8.0%) shows minor age-associated variation (point size reflects age in years). (C) PCA of raw GSE106582 dataset demonstrates a similar separation pattern between tumor and mucosa conditions along PC1 (16.1%), with slight dispersion across PC2 (8.0%) and a modest age gradient. (D) Age distribution in raw GSE106582, stratified by sample type, again revealing comparable age distributions between mucosa (normal) and tumor groups, centered around the seventh decade of life. (E) PCA of the merged, batch‐corrected dataset (GSE44076 + GSE106582) colored by sample condition. The batch effect has been successfully removed, as indicated by clear clustering by biological condition rather than by dataset of origin. (F) PCA of the merged batch‐corrected dataset colored by batch (GSE44076 vs. GSE106582). Samples from both datasets are interspersed across the PCA space, confirming that ComBat effectively mitigated batch effects. Point size in all PCA plots reflects sample age (smaller = younger, larger = older).
Fig. 3
Fig. 3
Top 10 Differentially Expressed Genes (Tumor) Across Therapeutic, Glycosylation, and Immune Categories by Age Group. This figure presents heatmaps showing the expression levels (Tumor) of the top 10 differentially expressed genes across different functional categories: Therapeutic, Glycosylation, and Immune, stratified by age groups (Young vs. Old). The values are used to highlight and visualize clear differences in gene expression between younger (< 70) and older (≥ 70) colorectal cancer patients, illustrating potential age-related transcriptional variations across these critical biological pathways. (A) Top 10 Differentially Expressed Genes (Therapeutic (Tumor)). (B) Top 10 Differentially Expressed Genes (Glycosylation (Tumor)). (C) Top 10 Differentially Expressed Genes (Immune (Tumor)).
Fig. 4
Fig. 4
Age-stratified feature importance, differential expression, learning curves, and permutation tests for the Therapeutic gene set. (A, B) Top 10 Random Forest (RF) features ranked by impurity-based importance in the Young (< 70) and Old (≥ 70) age strata. Gene symbols correspond to RF-selected probes with the highest predictive contribution. (C, D) Volcano plots showing age-associated differential expression (log2 fold-change Old/Young vs. -log10(p)) across the transcriptome within each age group. The top 10 RF-selected genes are highlighted. (E, F) Learning curves for RF models trained on the top three features (RF Top-3) in each age stratum. Curves show mean balanced accuracy across outer folds (orange, ± SD) versus training set size, alongside training performance (blue), based on fivefold × 10-repeat nested stratified cross-validation. (G, H) Permutation testing results for the RF Top-3 models in each stratum. Histograms display the null distribution of balanced accuracies obtained from 100 label-shuffled permutations, with the observed score (red dashed line) plotted for comparison. P-values were calculated as the fraction of permuted accuracies equal to or exceeding the observed value, confirming statistical significance.
Fig. 5
Fig. 5
Age‐stratified classification performance: confusion matrices for RF and KTSP models (Therapeutic gene set). Confusion matrices show the distribution of correctly and incorrectly classified samples across three sample types (Mucosa, Normal, Tumor) in the young and old age groups. (A, B, C, D, E, F) display Random Forest models trained with the top 10 and top 3 features and the multiclass KTSP model, respectively. Color intensity reflects absolute sample counts, with darker tiles representing higher counts.
Fig. 6
Fig. 6
External validation of age-stratified classifiers using independent GSE106582 dataset. (A) Confusion matrix of the Young (< 70) RF Top-10 classifier applied to the batch-corrected GSE106582 dataset. (B) Confusion matrix of the Old (≥ 70) RF Top-10 classifier applied to the same dataset. (C) ROC curve of the Young RF Top-10 model (macro-averaged across classes). (C) Macro-averaged ROC curve of the Young RF Top-10 model. (D) Macro-averaged ROC curve of the Old RF Top-10 model. (E) Confusion matrix of the Young KTSP classifier applied to GSE106582. (F) Confusion matrix of the Old KTSP classifier. (G) Macro-averaged ROC curve of the Young KTSP model. (H) Macro-averaged ROC curve of the Old KTSP model. Dashed orange diagonal indicates performance of a random classifier (AUC = 0.5).
Fig. 7
Fig. 7
Identification of age-dependent biomarkers in GSE44076. (A) Top 50 most important Therapeutic genes identified by a Random Forest trained on raw GSE44076 expression values with continuous Age included as an interaction term. Red bars correspond to interaction features (Gene × Age), indicating an age-dependent effect. (B) Top 50 most important Immune genes under the same model, with red bars indicating age-dependent interaction features. (C) Top 50 most important Glycosylation genes, again highlighting interaction features (Gene × Age) in red. Feature importance was computed across all genes and interaction terms, and plotted using the mean decrease in Gini impurity.
Fig. 8
Fig. 8
Comparison of Classifier Performance Across Gene Categories and Age Groups. Bar plots display the top 10 predictive genes selected by Random Forest (RF), Gradient Boosting Machine (GBM), and Support Vector Machine (SVM) classifiers based on feature importance rankings in each gene category and age group. Models were trained separately for younger (< 70 years) and older (≥ 70 years) cohorts, using the top 10 genes derived from each classifier. Panels are grouped by classifier and gene category: (A, B) Therapeutic genes in young and old patients (GBM); (C, D) Therapeutic genes in young and old patients (SVM); (E, F) Immune genes in young and old patients (GBM); (G, H) Immune genes in young and old patients (SVM); (I, J) Glycosylation genes in young and old patients (GBM); (K, L) Glycosylation genes in young and old patients (SVM). The x-axis indicates the normalized feature importance score, and the y-axis lists gene symbols. Feature importance values reflect the relative contribution of each gene to the classifier’s prediction performance, highlighting variability in gene prioritization across classifiers and age-defined CRC subgroups.
Fig. 9
Fig. 9
Circular bar plots showing the top enriched pathways for Therapeutic (A), Immune (B), and Glycosylation (C) based on differentially expressed genes from the tumor group only. Each bar represents a pathway, with bar height corresponding to formula image (significance), and pathway names placed around the circle. The gene symbols that overlap with each enriched pathway are shown on the outer ring, rotated for clarity.

References

    1. Siegel, R. L., Miller, K. D., Fuchs, H. E. & Jemal, A. Cancer statistics. Cancer J. Clin.1, 7–33 (2022). - DOI - PubMed
    1. Dekker, E., Tanis, P. J., Vleugels, J. L. A., Kasi, P. M. & Wallace, M. B. Colorectal cancer. Lancet10207, 1467–1480 (2019). - DOI - PubMed
    1. Lauss, M. et al. Monitoring of technical variation in quantitative high-throughput datasets. Cancer Inform., 193–201 (2013). - PMC - PubMed
    1. Xia, J., Broadhurst, D. I., Wilson, M. & Wishart, D. S. Translational biomarker discovery in clinical metabolomics: an introductory tutorial. Metabolomics2, 280–299 (2012). - PMC - PubMed
    1. Wang, Z. et al. Age-related dysregulation of intestinal epithelium fucosylation is linked to an increased risk of colon cancer. JCI Insight9 (2024). - PMC - PubMed

Substances