Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 29;15(1):3728.
doi: 10.1038/s41598-025-87622-3.

Leveraging survival analysis and machine learning for accurate prediction of breast cancer recurrence and metastasis

Affiliations

Leveraging survival analysis and machine learning for accurate prediction of breast cancer recurrence and metastasis

Shahd M Noman et al. Sci Rep. .

Abstract

Breast cancer, with its high incidence and mortality globally, necessitates early prediction of local and distant recurrence to improve treatment outcomes. This study develops and validates predictive models for breast cancer recurrence and metastasis using Recurrence-Free Survival Analysis and machine learning techniques. We merged datasets from the Molecular Taxonomy of Breast Cancer International Consortium, Memorial Sloan Kettering Cancer Center, Duke University, and the SEER program, creating a comprehensive dataset of 272, 252 rows and 23 columns. Our methodology utilized three predictive strategies: assessing recurrence risk, differentiating local from distant recurrences, and identifying potential metastatic sites. Key prognostic factors were identified through survival analysis. LightGBM, XGBoost, and Random Forest models were employed and validated against data from the Baheya Foundation. The models demonstrated strong performance; the survival analysis achieved a C-index of 0.837. The LightGBM model reached an AUC of 92% in predicting recurrences, while XGBoost and Random Forest models distinguished recurrence types with up to 86% accuracy, and they effectively differentiated between bone metastasis and all other locations combined (brain, liver, and lungs). This study highlights the significant potential of machine learning in advancing breast cancer management and sets a new benchmark for predictive analytics. Future research will integrate genetic data to further enhance these models.

Keywords: Breast cancer; Machine learning; Metastasis; Recurrence prediction; Survival analysis.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Recurrence-free survival analysis. (A) Kaplan-Meier RFS curve. Kaplan-Meier survival plots adjusted for study covariates: (B) Molecular Subtype, (C) HER2, (D) Lymph Node Status, (E) Tumor Size, (F) Tumor Grade. (G) Nomogram for predicting RFS.
Fig. 2
Fig. 2
Recurrence vs. not recurrence approach results on validation set. (A) shows differences in training and testing accuracies of 6 machine learning models. (B) reveals the performance of the evaluation metrics: Accuracy, Recall, and F1 score. (C) combines the confusion matrices of all 6 models. The confusion matrix displays the predicted classes on the X-axis and the true classes on the Y-axis, with the color of the diagonal blocks illustrating the closeness of the match between the predicted and the true class. The darker the blue color of the diagonal line, the better the model prediction accuracy. (D) is a combined ROC curve for all 6 models. (E) visualizes statistical p-value results with cross-validation of each model.
Fig. 3
Fig. 3
Local vs. distant recurrence results on the validation set. (A) shows slight differences in training and testing accuracies of 6 machine learning models. (B) reveals the performance of the evaluation metrics: Accuracy, Recall, and F1 score. (C) combines the confusion matrices of all 6 models. The confusion matrix displays the predicted classes on the X-axis and the true classes on the Y-axis, with the color of the diagonal blocks illustrating the closeness of the match between the predicted and true class. The darker the blue color of the diagonal line, the better the model prediction accuracy. (D) is a combined ROC curve for all 6 models.
Fig. 4
Fig. 4
Distant sites prediction results on test set. (A) Multi-class classification SVM confusion matrix showing accuracy distribution across multiple classes. (B) ROC curve of the SVM model, displaying the AUC values for bone, lung, liver, and brain metastasis predictions. (C) Confusion matrix for binary classification distinguishing between ’Bone’ and ’Other’ locations. (D) ROC for the binary classification model, highlighting its performance in differentiating between the two categories.
Fig. 5
Fig. 5
Features distribution across metastatic locations. (A) PCA distribution of metastatic cancer data showcasing the variance explained by the first principal component (PC1) across different metastasis types (Bone, Brain, Liver, lung). (B) Denisty plot that illustrates the overlapped clusters between different locations. (C) Histograms showing the distribution of tumor characteristics and treatment modalities across different subtypes and responses, such as hormone receptor status (ER, HER2 ±), tumor size (T1, T2, T3, T4), molecular subtype (Luminal A, Luminal B), and treatments received (Chemotherapy, Radiotherapy). (D) SHAP feature importance.
Fig. 6
Fig. 6
Workflow diagram of the comprehensive methodology from data collection to validation. This diagram outlines the process starting with the aggregation of training data from multiple sources, followed by detailed preprocessing steps. The methodology incorporates survival analysis and diverse machine learning techniques for outcome prediction, with subsequent hyperparameter tuning and model evaluation. The final phase involves external validation with population data to ensure model robustness.
Fig. 7
Fig. 7
Overview of data preprocessing and integration pipeline: datasets from Metabric, MSK, Duke, SEER, and the Baheya Foundation are preprocessed, merged, and feature-selected to form comprehensive datasets for recurrence type, recurrence status, and distant site prediction approaches.

References

    1. Global Cancer Observatory, International Agency for Research on Cancer. Global cancer observatory. https://gco.iarc.fr/en.
    1. World Health Organization. Breast cancer. World Health Organization. https://www.who.int/news-room/fact-sheets/detail/breast-cancer.
    1. Abdelaziz, A. H. et al. Breast cancer awareness among egyptian women and the impact of caring for patients with breast cancer on family caregivers’ knowledge and behaviour. Res. Oncol.17, 1–8 (2021).
    1. Schlichting, J. A. et al. Breast cancer by age at diagnosis in the gharbiah, egypt, population-based registry compared to the united states surveillance, epidemiology, and end results program, 2004–2008. Biomed. Res. Int.1–9, 2015. 10.1155/2015/381574 (2015). - PMC - PubMed
    1. Baheya Foundation. Baheya Foundation. https://baheya.org/en.

Publication types

LinkOut - more resources