Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 25;15(1):14554.
doi: 10.1038/s41598-025-98703-8.

Survival prediction from imbalanced colorectal cancer dataset using hybrid sampling methods and tree-based classifiers

Affiliations

Survival prediction from imbalanced colorectal cancer dataset using hybrid sampling methods and tree-based classifiers

Sadegh Soleimani et al. Sci Rep. .

Abstract

Colorectal cancer is a high mortality cancer, with a mortality rate of 64.5% for all stages combined. Clinical data analysis plays a crucial role in predicting the survival of colorectal cancer patients, enabling clinicians to make informed treatment decisions. However, utilizing clinical data can be challenging, especially when dealing with imbalanced outcomes, an aspect often overlooked in this context. This paper focuses on developing algorithms to predict 1-, 3-, and 5-year survival of colorectal cancer patients using clinical datasets, with particular emphasis on the highly imbalanced 1-year survival prediction task. We utilized a colorectal cancer dataset from the Surveillance, Epidemiology, and End Results (SEER) database, which exhibits high imbalance in the 1-year (1:10) survival analysis and an imbalance in the 3-year (2:10) analysis, achieving balance in the 5-year analysis. The pre-processing step consists of removing records with missing values and merging categories with less than 2% share for each categorical feature to limit the number of classes of each component. Edited Nearest Neighbor, Repeated Edited Nearest Neighbor (RENN), Synthetic Minority Over-sampling Technique (SMOTE), and pipelines of SMOTE and RENN approaches were used for balancing the data with tree-based classifiers, including Decision Tree, Random Forest, Extra Tree, eXtreme Gradient Boosting, and Light Gradient Boosting Machine (LGBM). The performance evaluation utilizes a 5-fold cross-validation approach. In the case of 1-year, our proposed method with LGBM significantly outperforms other sampling methods with the sensitivity of 72.30%. For the task of 3-year survival, the combination of RENN and LGBM achieves a sensitivity of 80.81%, indicating that our proposed method works best for highly imbalanced datasets. Additionally, when predicting 5-year survival, the sensitivity reaches 63.03% using LGBM. Our proposed method significantly improves mortality prediction for the minority class of colorectal cancer patients. RENN followed by SMOTE yields better sensitivity in the classifiers, with LGBM as the predictor performing best for 1- and 3-year survival. In the 5-year task, LGBM outperforms other models in terms of F1-score.

Keywords: Colorectal cancer; Repeated edited nearest neighbor; SEER; Survival prediction; Synthetic minority over-sampling technique.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests. Human studies/informed consent: The authors carried out no human studies for this article. Animal studies: The authors carried out no animal studies for this article.

Figures

Fig. 1
Fig. 1
Block diagram representation of the proposed framework for colorectal cancer survival prediction.
Fig. 2
Fig. 2
Histogram of patient survival times (in months).
Fig. 3
Fig. 3
Correlogram of features in the processed dataset.
None
Algorithm 1. Hybrid RENN + SMOTE Sampling
Fig. 4
Fig. 4
Comparison of tailored sampling methods including ENN, RENN, SMOTE, RENN + SMOTE, and SMOTE + RENN techniques.
Fig. 5
Fig. 5
Sensitivities (left axis) and F1-score (right axis) of all tasks for the classifiers with no sampling method.
Fig. 6
Fig. 6
Sensitivity and F1-score of 1-year survival for each classifier using No-sampling and 3 top samplers, including SMOTE, RENN, and RENN + SMOTE techniques.
Fig. 7
Fig. 7
Sensitivity and F1-score of 3-year survival for each classifier using No-sampling and 3 top samplers, including SMOTE, RENN, and RENN  + SMOTE techniques.

Similar articles

Cited by

References

    1. Parkin, D. M., Bray, F., Ferlay, J. & Pisani, P. Global cancer Stat. 2002 CA: cancer J. Clin., 55, 2, 74–108, (2005). - PubMed
    1. Rawla, P., Sunkara, T. & Barsouk, A. Epidemiology of colorectal cancer: incidence, mortality, survival, and risk factors. Gastroenterol. Review/Przegląd Gastroenterologiczny. 14 (2), 89–103 (2019). - PMC - PubMed
    1. Botteri, E. et al. Smoking and colorectal cancer: a meta-analysis, Jama, vol. 300, no. 23, pp. 2765–2778, (2008). - PubMed
    1. Bilimoria, K. Y., Stewart, A. K., Winchester, D. P. & Ko, C. Y. The National Cancer data base: a powerful initiative to improve cancer care in the united States. Ann. Surg. Oncol.15, 683–690 (2008). - PMC - PubMed
    1. Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V. & Fotiadis, D. I. Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J.13, 8–17 (2015). - PMC - PubMed

LinkOut - more resources