Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 1;25(1):1047.
doi: 10.1186/s12885-025-14329-z.

Risk factors and prediction of distant metastasis (DM) of colon adenocarcinoma: a logistic regression and machine learning study based on surveillance, epidemiology, and end results (SEER) database

Affiliations

Risk factors and prediction of distant metastasis (DM) of colon adenocarcinoma: a logistic regression and machine learning study based on surveillance, epidemiology, and end results (SEER) database

Qiang Guo et al. BMC Cancer. .

Abstract

Background: Given the limitations of traditional imaging examinations to detect distant metastasis (DM) (e.g., low sensitivity), this study is to identify pathological and laboratory risk factors and establish models predicting distant metastasis of colon adenocarcinoma (CA) patients.

Methods: CA Patients diagnosed between the year of 2018 and 2021 were retrieved from SEER. Logistic regression was utilized to find independent risk factors (IRFs) of DM and 12 models including BNB (Bernoulli naïve bayes), DT (Decision tree), GBC (Gradient Boosting Classifier), GNB (Gaussian naïve bayes), KNN (K-nearest neighbor), LDA (Linear Discriminant Analysis), LR (Logistic regression), MLP (Multi-layer perceptron classifier), MNB (Multinomial naïve bayes), QDA (Quadratic discriminant analysis), RFC (Random forest classifier) and SVC (Support vector machine) were established and evaluated on the training set and test set (7:3) of the retrieved patients. Additionally, CA patient data was collected from Jincheng People’s Hospital (JCPH) as an external validation set for the prediction efficacy of the models.

Results: 7,000 and 83 CA patients were retrieved from SEER and JCPH respectively, and 8 IRFs including age 60–79 (OR = 0.589, 95% CI: 0.391–0.887) and age > 80 (OR = 0.456, 95% CI: 0.287–0.722), primary site – cecum (OR = 1.305, 95% CI: 1.023–1.664), TNM stage – T3 (OR = 8.869, 95% CI: 2.151–36.569) and T4 (OR = 15.912, 95% CI: 3.839–65.955), TNM stage – N1 (OR = 3.853, 95% CI: 2.919–5.087) and N2 (OR = 8.480, 95% CI: 6.322–11.374), number of regional nodes examined > 12 (OR = 0.439, 95% CI: 0.326–0.591), tumor deposits (OR = 1.989, 95% CI: 1.639–2.414), carcinoembryonic antigen (CEA) level (OR = 4.552, 95% CI: 3.747–5.530) and perineural invasion (OR = 1.352, 95% CI: 1.112–1.643) were identified. LR showed the best predictive efficacy both on the test (AUC = 0.892, sensitivity = 0.825, specificity = 0.801) and external validation set (AUC = 0.868, sensitivity = 1.000, specificity = 0.727).

Conclusions: Machine learning is a promising way to assist the detection of DM for CA patients.

Keywords: Colon adenocarcinoma; Distant metastasis; Machine learning; Risk factor.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: This retrospective study was conducted in accordance with the ethical standards of the Declaration of Helsinki and was approved by the Institutional Review Board (IRB) of Jincheng People’s Hospital (JCPH) (Approval No. 20250314003). Given the retrospective nature of the research, which utilized anonymized clinical data from the SEER database and JCPH, the IRB granted a waiver for the requirement of obtaining informed consents from participants. This decision was based on the following considerations: (1) the study involved no more than minimal risk to participants, (2) the waiver would not adversely affect the rights and welfare of the participants, (3) the research could not practicably be carried out without the waiver, and (4) the study did not involve any procedures for which written consent is normally required outside of the research context. All data were de-identified to ensure patient confidentiality and privacy. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Flow chart of the Data Retrieve (A). Spearman Analysis of the IRFs (B). Cross_validation Evaluation of AUCs of the Models on the Training Set (C). Variance Inflation Factor Analysis of the IRFs (D). *SD = standard deviation
Fig. 2
Fig. 2
ROCs of the LR, GBC and MLP on training set (A), test set (B) and external validation set (C). SHAP Plot for the LR on Test Set (D): Each point on the summary plot is the Shapley value for one IRF and one instance. All IRFs are in descending order of importance along the vertical axis (from top to bottom). The colors represent the values of the IRFs from low (blue) to high (red). The point on the right of the vertical axis (i.e., positive Shapley value) means the IRF increase the possibility of predicting M1

Similar articles

References

    1. Zheng Z, Luo H, Deng K, Li Q, Xu Q, Liu K. Evaluating the prognostic value of tumor deposits in non-metastatic lymph node-positive colon adenocarcinoma using Cox regression and machine learning. Int J Colorectal Dis. 2024;39:97. - PMC - PubMed
    1. Cancer (IARC.) TIA for R on. Global Cancer Observatory. https://gco.iarc.fr/. Accessed 11 Dec 2024.
    1. Cronin KA, Scott S, Firth AU, Sung H, Henley SJ, Sherman RL et al. Annual report to the nation on the status of cancer, part 1: National cancer statistics. Cancer. 2022;128:4251–84. - PMC - PubMed
    1. Siegel RL, Giaquinto AN, Jemal A. Cancer statistics, 2024. CA Cancer J Clin. 2024;74:12–49. - PubMed
    1. Overview of the management of primary colon cancer - UpToDate. https://www.uptodate.com/contents/overview-of-the-management-of-primary-.... Accessed 13 Dec 2024.

LinkOut - more resources