Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Multicenter Study
. 2025 Mar 21;31(11):102387.
doi: 10.3748/wjg.v31.i11.102387.

Construction and validation of machine learning-based predictive model for colorectal polyp recurrence one year after endoscopic mucosal resection

Affiliations
Multicenter Study

Construction and validation of machine learning-based predictive model for colorectal polyp recurrence one year after endoscopic mucosal resection

Yi-Heng Shi et al. World J Gastroenterol. .

Abstract

Background: Colorectal polyps are precancerous diseases of colorectal cancer. Early detection and resection of colorectal polyps can effectively reduce the mortality of colorectal cancer. Endoscopic mucosal resection (EMR) is a common polypectomy procedure in clinical practice, but it has a high postoperative recurrence rate. Currently, there is no predictive model for the recurrence of colorectal polyps after EMR.

Aim: To construct and validate a machine learning (ML) model for predicting the risk of colorectal polyp recurrence one year after EMR.

Methods: This study retrospectively collected data from 1694 patients at three medical centers in Xuzhou. Additionally, a total of 166 patients were collected to form a prospective validation set. Feature variable screening was conducted using univariate and multivariate logistic regression analyses, and five ML algorithms were used to construct the predictive models. The optimal models were evaluated based on different performance metrics. Decision curve analysis (DCA) and SHapley Additive exPlanation (SHAP) analysis were performed to assess clinical applicability and predictor importance.

Results: Multivariate logistic regression analysis identified 8 independent risk factors for colorectal polyp recurrence one year after EMR (P < 0.05). Among the models, eXtreme Gradient Boosting (XGBoost) demonstrated the highest area under the curve (AUC) in the training set, internal validation set, and prospective validation set, with AUCs of 0.909 (95%CI: 0.89-0.92), 0.921 (95%CI: 0.90-0.94), and 0.963 (95%CI: 0.94-0.99), respectively. DCA indicated favorable clinical utility for the XGBoost model. SHAP analysis identified smoking history, family history, and age as the top three most important predictors in the model.

Conclusion: The XGBoost model has the best predictive performance and can assist clinicians in providing individualized colonoscopy follow-up recommendations.

Keywords: Colorectal polyps; Machine learning; Predictive model; Risk factors; SHapley Additive exPlanation.

PubMed Disclaimer

Conflict of interest statement

Conflict-of-interest statement: All the authors report no relevant conflicts of interest for this article.

Figures

Figure 1
Figure 1
Flowchart of study design route. EMR: Endoscopic mucosal resection; LR: Logistic Regression; DT: Decision Trees; RF: Random Forest; SVM: Support Vector Machine; XGBoost: EXtreme Gradient Boosting; ROC: Receiver operating characteristic; DCA: Decision curve analysis; SHAP: SHapley Additive exPlanations.
Figure 2
Figure 2
Receiver operating characteristic curves of different models across various datasets. A: Training set; B: Validation set; C: Prospective set. LR: Logistic Regression; DT: Decision Trees; RF: Random Forest; SVM: Support Vector Machine; XGBoost: EXtreme Gradient Boosting; AUC: Area under the curve.
Figure 3
Figure 3
Decision curves of the eXtreme Gradient Boosting model across various datasets. A: Training set; B: Validation set; C: Prospective set. XGBoost: EXtreme Gradient Boosting.
Figure 4
Figure 4
SHapley Additive exPlanations analysis of the XGBoost model. A: SHapley Additive exPlanations (SHAP) summary bar plot, where features are ranked in descending order according to the mean absolute SHAP value; B: SHAP beeswarm plot, displaying the SHAP value of each feature for every sample in the dataset. Each row represents a feature, and each dot corresponds to a sample. The color of the dots indicates feature values, with yellow representing high values and purple representing low values. SHAP: SHapley Additive exPlanations.
Figure 5
Figure 5
Online web calculator for predicting colorectal polyp recurrence 1 year after Endoscopic mucosal resection. EMR: Endoscopic mucosal resection; XGBoost: EXtreme Gradient Boosting; CRC: Colorectal cancer.

References

    1. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin. 2021;71:209–249. - PubMed
    1. Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer statistics, 2022. CA Cancer J Clin. 2022;72:7–33. - PubMed
    1. Nakanishi Y, Diaz-Meco MT, Moscat J. Serrated Colorectal Cancer: The Road Less Travelled? Trends Cancer. 2019;5:742–754. - PMC - PubMed
    1. Li J, Ma X, Chakravarti D, Shalapour S, DePinho RA. Genetic and biological hallmarks of colorectal cancer. Genes Dev. 2021;35:787–820. - PMC - PubMed
    1. Dekker E, Rex DK. Advances in CRC Prevention: Screening and Surveillance. Gastroenterology. 2018;154:1970–1984. - PubMed

MeSH terms