Survival prediction from imbalanced colorectal cancer dataset using hybrid sampling methods and tree-based classifiers
- PMID: 40281195
- PMCID: PMC12032297
- DOI: 10.1038/s41598-025-98703-8
Survival prediction from imbalanced colorectal cancer dataset using hybrid sampling methods and tree-based classifiers
Abstract
Colorectal cancer is a high mortality cancer, with a mortality rate of 64.5% for all stages combined. Clinical data analysis plays a crucial role in predicting the survival of colorectal cancer patients, enabling clinicians to make informed treatment decisions. However, utilizing clinical data can be challenging, especially when dealing with imbalanced outcomes, an aspect often overlooked in this context. This paper focuses on developing algorithms to predict 1-, 3-, and 5-year survival of colorectal cancer patients using clinical datasets, with particular emphasis on the highly imbalanced 1-year survival prediction task. We utilized a colorectal cancer dataset from the Surveillance, Epidemiology, and End Results (SEER) database, which exhibits high imbalance in the 1-year (1:10) survival analysis and an imbalance in the 3-year (2:10) analysis, achieving balance in the 5-year analysis. The pre-processing step consists of removing records with missing values and merging categories with less than 2% share for each categorical feature to limit the number of classes of each component. Edited Nearest Neighbor, Repeated Edited Nearest Neighbor (RENN), Synthetic Minority Over-sampling Technique (SMOTE), and pipelines of SMOTE and RENN approaches were used for balancing the data with tree-based classifiers, including Decision Tree, Random Forest, Extra Tree, eXtreme Gradient Boosting, and Light Gradient Boosting Machine (LGBM). The performance evaluation utilizes a 5-fold cross-validation approach. In the case of 1-year, our proposed method with LGBM significantly outperforms other sampling methods with the sensitivity of 72.30%. For the task of 3-year survival, the combination of RENN and LGBM achieves a sensitivity of 80.81%, indicating that our proposed method works best for highly imbalanced datasets. Additionally, when predicting 5-year survival, the sensitivity reaches 63.03% using LGBM. Our proposed method significantly improves mortality prediction for the minority class of colorectal cancer patients. RENN followed by SMOTE yields better sensitivity in the classifiers, with LGBM as the predictor performing best for 1- and 3-year survival. In the 5-year task, LGBM outperforms other models in terms of F1-score.
Keywords: Colorectal cancer; Repeated edited nearest neighbor; SEER; Survival prediction; Synthetic minority over-sampling technique.
© 2025. The Author(s).
Conflict of interest statement
Declarations. Competing interests: The authors declare no competing interests. Human studies/informed consent: The authors carried out no human studies for this article. Animal studies: The authors carried out no animal studies for this article.
Figures








Similar articles
-
A hybrid sampling algorithm combining synthetic minority over-sampling technique and edited nearest neighbor for missed abortion diagnosis.BMC Med Inform Decis Mak. 2022 Dec 29;22(1):344. doi: 10.1186/s12911-022-02075-2. BMC Med Inform Decis Mak. 2022. PMID: 36581862 Free PMC article.
-
Interaction effect between data discretization and data resampling for class-imbalanced medical datasets.Technol Health Care. 2025 Mar;33(2):1000-1013. doi: 10.1177/09287329241295874. Epub 2024 Nov 25. Technol Health Care. 2025. PMID: 40105161
-
Investigating perioperative pressure injuries and factors influencing them with imbalanced samples using a Synthetic Minority Over-sampling Technique.Biosci Trends. 2025 May 9;19(2):173-188. doi: 10.5582/bst.2025.01013. Epub 2025 Apr 15. Biosci Trends. 2025. PMID: 40240165
-
A comprehensive data level analysis for cancer diagnosis on imbalanced data.J Biomed Inform. 2019 Feb;90:103089. doi: 10.1016/j.jbi.2018.12.003. Epub 2019 Jan 3. J Biomed Inform. 2019. PMID: 30611011 Review.
-
Optimization of diabetes prediction methods based on combinatorial balancing algorithm.Nutr Diabetes. 2024 Aug 14;14(1):63. doi: 10.1038/s41387-024-00324-z. Nutr Diabetes. 2024. PMID: 39143066 Free PMC article. Review.
Cited by
-
A Novel Ensemble Framework for Comprehensive Early-Stage Colorectal Cancer Diagnosis, Prognosis, and Treatment: Integration of Gastroenterology-Specific Transformer Language Models and Multiple Decision Trees.J Clin Med. 2025 Jun 23;14(13):4467. doi: 10.3390/jcm14134467. J Clin Med. 2025. PMID: 40648841 Free PMC article.
References
MeSH terms
LinkOut - more resources
Full Text Sources
Medical