Risk factors and prediction of distant metastasis (DM) of colon adenocarcinoma: a logistic regression and machine learning study based on surveillance, epidemiology, and end results (SEER) database
- PMID: 40597951
- PMCID: PMC12211135
- DOI: 10.1186/s12885-025-14329-z
Risk factors and prediction of distant metastasis (DM) of colon adenocarcinoma: a logistic regression and machine learning study based on surveillance, epidemiology, and end results (SEER) database
Abstract
Background: Given the limitations of traditional imaging examinations to detect distant metastasis (DM) (e.g., low sensitivity), this study is to identify pathological and laboratory risk factors and establish models predicting distant metastasis of colon adenocarcinoma (CA) patients.
Methods: CA Patients diagnosed between the year of 2018 and 2021 were retrieved from SEER. Logistic regression was utilized to find independent risk factors (IRFs) of DM and 12 models including BNB (Bernoulli naïve bayes), DT (Decision tree), GBC (Gradient Boosting Classifier), GNB (Gaussian naïve bayes), KNN (K-nearest neighbor), LDA (Linear Discriminant Analysis), LR (Logistic regression), MLP (Multi-layer perceptron classifier), MNB (Multinomial naïve bayes), QDA (Quadratic discriminant analysis), RFC (Random forest classifier) and SVC (Support vector machine) were established and evaluated on the training set and test set (7:3) of the retrieved patients. Additionally, CA patient data was collected from Jincheng People’s Hospital (JCPH) as an external validation set for the prediction efficacy of the models.
Results: 7,000 and 83 CA patients were retrieved from SEER and JCPH respectively, and 8 IRFs including age 60–79 (OR = 0.589, 95% CI: 0.391–0.887) and age > 80 (OR = 0.456, 95% CI: 0.287–0.722), primary site – cecum (OR = 1.305, 95% CI: 1.023–1.664), TNM stage – T3 (OR = 8.869, 95% CI: 2.151–36.569) and T4 (OR = 15.912, 95% CI: 3.839–65.955), TNM stage – N1 (OR = 3.853, 95% CI: 2.919–5.087) and N2 (OR = 8.480, 95% CI: 6.322–11.374), number of regional nodes examined > 12 (OR = 0.439, 95% CI: 0.326–0.591), tumor deposits (OR = 1.989, 95% CI: 1.639–2.414), carcinoembryonic antigen (CEA) level (OR = 4.552, 95% CI: 3.747–5.530) and perineural invasion (OR = 1.352, 95% CI: 1.112–1.643) were identified. LR showed the best predictive efficacy both on the test (AUC = 0.892, sensitivity = 0.825, specificity = 0.801) and external validation set (AUC = 0.868, sensitivity = 1.000, specificity = 0.727).
Conclusions: Machine learning is a promising way to assist the detection of DM for CA patients.
Keywords: Colon adenocarcinoma; Distant metastasis; Machine learning; Risk factor.
Conflict of interest statement
Declarations. Ethics approval and consent to participate: This retrospective study was conducted in accordance with the ethical standards of the Declaration of Helsinki and was approved by the Institutional Review Board (IRB) of Jincheng People’s Hospital (JCPH) (Approval No. 20250314003). Given the retrospective nature of the research, which utilized anonymized clinical data from the SEER database and JCPH, the IRB granted a waiver for the requirement of obtaining informed consents from participants. This decision was based on the following considerations: (1) the study involved no more than minimal risk to participants, (2) the waiver would not adversely affect the rights and welfare of the participants, (3) the research could not practicably be carried out without the waiver, and (4) the study did not involve any procedures for which written consent is normally required outside of the research context. All data were de-identified to ensure patient confidentiality and privacy. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.
Figures


Similar articles
-
An explainable machine learning model for predicting the risk of distant metastasis in intrahepatic cholangiocarcinoma: a population-based cohort study.Discov Oncol. 2025 Jun 18;16(1):1140. doi: 10.1007/s12672-025-02952-y. Discov Oncol. 2025. PMID: 40531423 Free PMC article.
-
Which Types of Patients With Extensive-Stage Small Cell Lung Cancer Benefit From Radiotherapy? A Retrospective Study Integrating Machine Learning With the SEER Database and a Chinese Cohort.Cancer Control. 2025 Jan-Dec;32:10732748251347679. doi: 10.1177/10732748251347679. Epub 2025 Jun 2. Cancer Control. 2025. PMID: 40454687 Free PMC article.
-
Machine learning to predict distant metastasis and prognostic analysis of moderately differentiated gastric adenocarcinoma patients: a novel focus on lymph node indicators.Front Immunol. 2024 Sep 19;15:1398685. doi: 10.3389/fimmu.2024.1398685. eCollection 2024. Front Immunol. 2024. PMID: 39364413 Free PMC article.
-
Reporting and risk of bias of prediction models based on machine learning methods in preterm birth: A systematic review.Acta Obstet Gynecol Scand. 2023 Jan;102(1):7-14. doi: 10.1111/aogs.14475. Epub 2022 Nov 17. Acta Obstet Gynecol Scand. 2023. PMID: 36397723 Free PMC article.
-
Predicting lung cancer survival based on clinical data using machine learning: A review.Comput Biol Med. 2023 Oct;165:107338. doi: 10.1016/j.compbiomed.2023.107338. Epub 2023 Aug 9. Comput Biol Med. 2023. PMID: 37625260
References
-
- Cancer (IARC.) TIA for R on. Global Cancer Observatory. https://gco.iarc.fr/. Accessed 11 Dec 2024.
-
- Siegel RL, Giaquinto AN, Jemal A. Cancer statistics, 2024. CA Cancer J Clin. 2024;74:12–49. - PubMed
-
- Overview of the management of primary colon cancer - UpToDate. https://www.uptodate.com/contents/overview-of-the-management-of-primary-.... Accessed 13 Dec 2024.
LinkOut - more resources
Full Text Sources