Age-stratified analysis of therapeutic, immune, and glycosylation gene expression in colorectal cancer using machine learning
- PMID: 41390349
- PMCID: PMC12804678
- DOI: 10.1038/s41598-025-31499-9
Age-stratified analysis of therapeutic, immune, and glycosylation gene expression in colorectal cancer using machine learning
Abstract
Colorectal cancer (CRC) is a major global health issue, yet current treatment strategies rarely consider patient age differences, leading to variable therapeutic efficacy and clinical outcomes. Although numerous biomarkers for CRC have been identified, their age-specific expression profiles and biological implications remain poorly understood. This knowledge gap limits the development and clinical deployment of age-tailored interventions. In this study, we applied an age-aware machine learning framework to uncover gene signatures stratified by age using the GSE44076 microarray dataset. We analyzed three CRC-relevant gene categories (Therapeutic, Immune, and Glycosylation) across three data versions: Original, WithAge (age as a feature), and Age Regressed (residual expression). Patient samples were stratified into younger (< 70 years) and older (≥ 70 years) cohorts to identify age-influenced molecular shifts. Random Forest (RF)-based feature selection yielded compact gene signatures that discriminated against tumor, normal, and mucosa. Under 5 × 10 nested cross-validation, Top-10 gene models achieved Balanced Accuracy ≈ 0.94-0.99 and macro-averaged OvR AUC ≈ 0.96-0.99, while Top-3 sets retained strong performance (Balanced Accuracy ≈ 0.87-0.95). We benchmarked RF against Gradient Boosting Machine (GBM), Support Vector Machine (SVM), and k-Top Scoring Pairs (KTSP) classifiers. RF provided the best overall multivariate metrics under shallow-tree constraints (depth 1-2; min leaf 1-2), KTSP models consistently captured age-dependent gene ranking shifts with enhanced sensitivity, especially in external validation. KTSP's robustness in identifying directional age effects was most evident in the Glycosylation category, where glycan-processing genes showed pronounced age-stratified performance. Permutation tests that replicated the full nested pipeline (N = 100) yielded p≈0.01, confirming that results are unlikely under the null. To uncover deeper regulatory mechanisms, we modeled gene-age interactions, revealing additional biomarkers. External validation using the GSE106582 cohort further supported our findings, although batch harmonization constrained gene coverage by removing platform-unique probes, highlighting a key limitation for microarray-based biomarker translation. Nonetheless, core signatures retained their predictive power, confirming generalizability. Altogether, our results establish age as a critical biological variable influencing CRC gene expression and classifier performance. This study provides a rigorous framework for integrating machine learning, statistical interaction modeling, and biological annotation to guide age-stratified biomarker discovery. Our findings support the development of precision oncology tools tailored to the distinct molecular landscapes of young-onset and late-onset CRC.
Keywords: Age-stratified analysis; Colorectal cancer; Glycosylation; Immune genes; Machine learning; Therapeutic genes.
© 2025. The Author(s).
Conflict of interest statement
Declarations. Competing interests: The authors declare no competing interests.
Figures
References
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Medical
Miscellaneous
