Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 1;25(1):1061.
doi: 10.1186/s12885-025-14444-x.

Machine learning algorithms predict breast cancer incidence risk: a data-driven retrospective study based on biochemical biomarkers

Affiliations

Machine learning algorithms predict breast cancer incidence risk: a data-driven retrospective study based on biochemical biomarkers

Qianqian Guo et al. BMC Cancer. .

Abstract

Background: Current breast cancer prediction models typically rely on personal information and medical history, with limited inclusion of blood-based biomarkers. This study aimed to identify novel breast cancer risk factors using machine learning algorithms. By integrating both personal clinical factors and peripheral blood biochemical biomarkers, it sought to enhance the understanding of breast cancer risk.

Methods: Data were screened and normalized according to predefined inclusion and exclusion criteria. Logistic regression with forward selection and six other machine learning algorithms were employed to identify variables associated with breast cancer incidence. The performance of the models was evaluated using the area under the curve (AUC) through 5-fold cross-validation.

Results: The data were divided into a training cohort of 17,360 cases and a testing cohort of 8,551 cases. Logistic regression analysis revealed that breast cancer incidence was increased with age (odds ratio [OR]:1.136, 95% confidence interval [CI]: [1.130, 1.142], P < 0.001), gamma-glutamyl transferase (GGT) (OR: 1.002, 95% CI: [1.000, 1.004], P = 0.014), and alanine transaminase (ALT) (OR: 1.005, 95% CI: [1.001, 1.008], P = 0.008). Furthermore, the six machine learning algorithms consistently identified GGT and ALT as the most significant predictive features. The AUC values obtained from the six models after 5-fold cross-validation ranged from 0.779 to 0.862, with accuracy ranging from 0.780 to 0.841.

Conclusions: Our study identified two biochemical biomarkers (GGT and ALT) as promising indicators for breast cancer prediction. Incorporating these findings into a tailored breast cancer risk prediction model is needed in our future research.

Keywords: Biochemical biomarkers; Breast cancer; Machine learning; Risk factor; Validation.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: This single-center retrospective study received approval from the Ethics Committee of Guangdong Provincial Hospital of Chinese Medicine (GPHCM) on 24 March 2023 (approval no. ZE2023-067–01). All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the Helsinki Declaration. Informed consent was waived as approved by the Ethics Committee because of the retrospective nature of the study. Consent for publication: Not applicable. Competing interests: The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
The GGT and ALT levels between patients with and without breast cancer. **P < 0.0001. Abbreviations: GGT, gamma-glutamyl transferase; ALT, alanine transaminase; BC, breast cancer; Non BC, Non-breast cancer
Fig. 2
Fig. 2
The logistic regression model predicting breast cancer (training cohort). A: age, menopause, previously diagnosed with atypical hyperplasia; B: age, menopause, previously diagnosed with atypical hyperplasia, GGT, ALT; C: age, menopause, previously diagnosed with atypical hyperplasia, GGT, ALT, ALB. Abbreviations: ROC, receiver operating characteristics; AUC, area under the curve; GGT, gamma-glutamyl transferase; ALT, alanine transaminase; ALB, albumin
Fig. 3
Fig. 3
Heatmap of Pearson correlation coefficients for continuous features
Fig. 4
Fig. 4
The ROC curves for the testing cohort of the machine learning model. Abbreviations: ROC, receiver operating characteristics; AUC, area under the curve
Fig. 5
Fig. 5
The calibration curves for the testing cohort of the machine learning model

Similar articles

References

    1. Siegel RL, Giaquinto AN, Jemal A. Cancer statistics, 2024. CA Cancer J Clin. 2024;74(1):12–49. - PubMed
    1. Xia C, Dong X, Li H, Cao M, Sun D, He S, Yang F, Yan X, Zhang S, Li N, et al. Cancer statistics in China and United States, 2022: profiles, trends, and determinants. Chin Med J. 2022;135(5):584–90. - PMC - PubMed
    1. Milosevic M, Jankovic D, Milenkovic A, Stojanov D. Early diagnosis and detection of breast cancer. Technology and health care : official journal of the European Society for Engineering and Medicine. 2018;26(4):729–59. - PubMed
    1. Wang L. Early diagnosis of breast cancer. Sensors (Basel, Switzerland). 2017;17(7):1572. - PMC - PubMed
    1. Aristokli N, Polycarpou I, Themistocleous SC, Sophocleous D, Mamais I. Comparison of the diagnostic performance of Magnetic Resonance Imaging (MRI), ultrasound and mammography for detection of breast cancer based on tumor type, breast density and patient’s history: a review. Radiography (London, England : 1995). 2022;28(3):848–56. - PubMed

LinkOut - more resources