Machine learning algorithms predict breast cancer incidence risk: a data-driven retrospective study based on biochemical biomarkers
- PMID: 40597962
- PMCID: PMC12211970
- DOI: 10.1186/s12885-025-14444-x
Machine learning algorithms predict breast cancer incidence risk: a data-driven retrospective study based on biochemical biomarkers
Abstract
Background: Current breast cancer prediction models typically rely on personal information and medical history, with limited inclusion of blood-based biomarkers. This study aimed to identify novel breast cancer risk factors using machine learning algorithms. By integrating both personal clinical factors and peripheral blood biochemical biomarkers, it sought to enhance the understanding of breast cancer risk.
Methods: Data were screened and normalized according to predefined inclusion and exclusion criteria. Logistic regression with forward selection and six other machine learning algorithms were employed to identify variables associated with breast cancer incidence. The performance of the models was evaluated using the area under the curve (AUC) through 5-fold cross-validation.
Results: The data were divided into a training cohort of 17,360 cases and a testing cohort of 8,551 cases. Logistic regression analysis revealed that breast cancer incidence was increased with age (odds ratio [OR]:1.136, 95% confidence interval [CI]: [1.130, 1.142], P < 0.001), gamma-glutamyl transferase (GGT) (OR: 1.002, 95% CI: [1.000, 1.004], P = 0.014), and alanine transaminase (ALT) (OR: 1.005, 95% CI: [1.001, 1.008], P = 0.008). Furthermore, the six machine learning algorithms consistently identified GGT and ALT as the most significant predictive features. The AUC values obtained from the six models after 5-fold cross-validation ranged from 0.779 to 0.862, with accuracy ranging from 0.780 to 0.841.
Conclusions: Our study identified two biochemical biomarkers (GGT and ALT) as promising indicators for breast cancer prediction. Incorporating these findings into a tailored breast cancer risk prediction model is needed in our future research.
Keywords: Biochemical biomarkers; Breast cancer; Machine learning; Risk factor; Validation.
© 2025. The Author(s).
Conflict of interest statement
Declarations. Ethics approval and consent to participate: This single-center retrospective study received approval from the Ethics Committee of Guangdong Provincial Hospital of Chinese Medicine (GPHCM) on 24 March 2023 (approval no. ZE2023-067–01). All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the Helsinki Declaration. Informed consent was waived as approved by the Ethics Committee because of the retrospective nature of the study. Consent for publication: Not applicable. Competing interests: The authors declare that they have no competing interests.
Figures





Similar articles
-
External validation of a machine learning prediction model for massive blood loss during surgery for spinal metastases: a multi-institutional study using 880 patients.Spine J. 2025 Jul;25(7):1386-1399. doi: 10.1016/j.spinee.2025.03.018. Epub 2025 Mar 27. Spine J. 2025. PMID: 40157430
-
Supervised Machine Learning Models for Predicting Sepsis-Associated Liver Injury in Patients With Sepsis: Development and Validation Study Based on a Multicenter Cohort Study.J Med Internet Res. 2025 May 26;27:e66733. doi: 10.2196/66733. J Med Internet Res. 2025. PMID: 40418571 Free PMC article.
-
Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340. Health Technol Assess. 2006. PMID: 16959170
-
Next-Generation Sequencing-Based Testing Among Patients With Advanced or Metastatic Nonsquamous Non-Small Cell Lung Cancer in the United States: Predictive Modeling Using Machine Learning Methods.JMIR Cancer. 2025 Jun 11;11:e64399. doi: 10.2196/64399. JMIR Cancer. 2025. PMID: 40497643 Free PMC article.
-
A rapid and systematic review of the clinical effectiveness and cost-effectiveness of topotecan for ovarian cancer.Health Technol Assess. 2001;5(28):1-110. doi: 10.3310/hta5280. Health Technol Assess. 2001. PMID: 11701100
References
-
- Siegel RL, Giaquinto AN, Jemal A. Cancer statistics, 2024. CA Cancer J Clin. 2024;74(1):12–49. - PubMed
-
- Milosevic M, Jankovic D, Milenkovic A, Stojanov D. Early diagnosis and detection of breast cancer. Technology and health care : official journal of the European Society for Engineering and Medicine. 2018;26(4):729–59. - PubMed
-
- Aristokli N, Polycarpou I, Themistocleous SC, Sophocleous D, Mamais I. Comparison of the diagnostic performance of Magnetic Resonance Imaging (MRI), ultrasound and mammography for detection of breast cancer based on tumor type, breast density and patient’s history: a review. Radiography (London, England : 1995). 2022;28(3):848–56. - PubMed
MeSH terms
Substances
Grants and funding
- 82274513, 82405608, and 82474504/National Natural Science Foundation of China
- 82274513, 82405608, and 82474504/National Natural Science Foundation of China
- NO.2024A03J0850/Guangzhou Science and Technology Plan Project
- NO.2024A03J0850/Guangzhou Science and Technology Plan Project
- YN2022QN01/Guangdong Hospital of Traditional Chinese Medicine Special Research Project on Traditional Chinese Medicine Science and Technology
LinkOut - more resources
Full Text Sources
Medical
Miscellaneous