Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jun 15;13(12):1432.
doi: 10.3390/healthcare13121432.

Understanding Cancer Risk Among Bangladeshi Women: An Explainable Machine Learning Approach to Socio-Reproductive Factors Using Tertiary Hospital Data

Affiliations

Understanding Cancer Risk Among Bangladeshi Women: An Explainable Machine Learning Approach to Socio-Reproductive Factors Using Tertiary Hospital Data

Muhammad Rafiqul Islam et al. Healthcare (Basel). .

Abstract

Background: Breast cancer poses a significant health challenge in Bangladesh, where limited screening and unique reproductive patterns contribute to delayed diagnoses and subtype-specific disparities. While reproductive risk factors such as age at menarche, parity, and contraceptive use are well studied in high-income countries, their associations with hormone-receptor-positive (HR+) and triple-negative breast cancer (TNBC) remain underexplored in low-resource settings.

Methods: A case-control study was conducted at the National Institute of Cancer Research and Hospital (NICRH) including 486 histopathologically confirmed breast cancer cases (246 HR+, 240 TNBC) and 443 cancer-free controls. Socio-demographic and reproductive data were collected through structured interviews. Machine learning models-including Logistic Regression, Lasso, Support Vector Machines, Random Forest, and XGBoost-were trained using stratified five-fold cross-validation. Model performance was evaluated using sensitivity, F1-score, and Area Under Receiver Operating Curve (AUROC). To interpret model predictions and quantify the contribution of individual features, we employed Shapley Additive exPlanation (SHAP) values.

Results: XGBoost achieved the highest overall performance (F1-score = 0.750), and SHAP-based interpretability revealed key predictors for each subtype. Rural residence, low education (≤5 years), and undernutrition were significant predictors across subtypes. Cesarean delivery and multiple abortions were more predictive of TNBC, while urban residence, employment, and higher education were more predictive of HR+. Age at menarche and age at first childbirth showed decreasing predictive importance with increasing age for HR+, while larger gaps between marriage and childbirth were more predictive of TNBC.

Conclusions: Our findings underscore the value of machine learning coupled with SHAP-based explainability in identifying context-specific risk factors for breast cancer subtypes in resource-limited settings. This approach enhances transparency and supports the development of targeted public health interventions to reduce breast cancer disparities in Bangladesh.

Keywords: breast cancer risk; explainable machine learning; women reproductive risk factors.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

Figure 1
Figure 1
Model performance comparison for predicting breast cancer subtypes. (A,B) Tables presenting performance metrics for predicting HR+ and TNBC subtypes across various machine learning models, including Logistic Regression (LR), Logistic Regression with Lasso Regularization (LR with Lasso), Support Vector Machine (SVM), Random Forest (RF), and XGBoost. Metrics include sensitivity (ST), specificity (SP), precision (PR), recall (RC), F1 score (F1), and balanced accuracy (BAC). (C,D) Receiver Operating Characteristic (ROC) curves with Area Under the Curve (AUC) scores for HR+ (C) and TNBC (D) predictions, illustrating model performance in distinguishing breast cancer subtypes. XGBoost consistently demonstrates robust performance, particularly for HR+ cases.
Figure 2
Figure 2
Comparison of SHAP values for HR+ and TNBC cases among Bangladeshi women. The plot illustrates the global mean SHAP values (dots) with 95% confidence intervals (horizontal lines) for each feature, providing insights into their contribution to the prediction of breast cancer subtypes. Red represents HR+ cases and blue represents TNBC cases, with bold vertical lines denoting significance (p-value < 0.05) based on Wilcoxon signed-rank tests. Features with significant SHAP values are highlighted with distinct diamond markers. Feature names reflect reproductive and socio-demographic factors critical to breast cancer risk in the study population.
Figure 3
Figure 3
Relative contributions of features to the prediction of triple-negative (TN) and HR+ breast cancer cases among Bangladeshi women. The x-axis represents the three tested hypotheses using Wilcoxon paired comparisons: μ_TN > μ_HR+ (greater importance for TN cases), μ_TN = μ_HR+ (no difference in importance), and μ_TN < μ_HR+ (greater importance for HR+ cases). The y-axis displays the features, with their labels positioned relative to their contribution. Blue dots indicate features that are more important for predicting triple-negative cases, while red dots represent features that are more important for hormone-receptor-positive cases. The central black dots are features for which the average importance is not significantly different between TN and HR+ cases, suggesting equal relevance for both subtypes.
Figure 4
Figure 4
SHAP values for numeric features across breast cancer subtypes (A). SHAP values for HR+ cases, illustrating the impact of numeric features, such as age, age at first baby, age at first marriage, age at menarche, gap between first marriage and first baby, and gap between menarche and first baby, on model predictions. (B) SHAP values for TNBC cases, highlighting the same numeric features. The scatter points represent individual SHAP values for each data instance, while the red trend lines represent the smoothed averages (with standard deviations shaded) across feature values, showing the overall contribution trends of each feature. A decreasing trend in importance for age at menarche in HR+ cases is observed, indicating reduced predictive importance with increasing age. In contrast, gaps between first marriage and first baby show increasing predictive relevance for TNBC cases as the gap widens.

Similar articles

References

    1. Hamid F., Roy T. Unveiling Sociocultural Barriers to Breast Cancer Awareness Among the South Asian Population: Case Study of Bangladesh and West Bengal, India. JMIR Hum. Factors. 2025;12:e53969. doi: 10.2196/53969. - DOI - PMC - PubMed
    1. Urbanization in Bangladesh the Prevalence of Breast Cancer Brings Unique Challenges—The ASCO Post [Internet] [(accessed on 11 March 2024)]. Available online: https://ascopost.com/issues/october-25-2021/urbanization-in-bangladesh-t...
    1. Wilkinson L., Gathani T. Understanding breast cancer as a global health concern. Br. J. Radiol. 2022;95:20211033. doi: 10.1259/bjr.20211033. - DOI - PMC - PubMed
    1. Hossain M.S., Ferdous S., Karim-Kos H.E. Breast cancer in South Asia: A Bangladeshi perspective. Cancer Epidemiol. 2014;38:465–470. doi: 10.1016/j.canep.2014.08.004. - DOI - PubMed
    1. Ma H., Bernstein L., Pike M.C., Ursin G. Reproductive factors and breast cancer risk according to joint estrogen and progesterone receptor status: A meta-analysis of epidemiological studies. Breast Cancer Res. 2006;8:R43. doi: 10.1186/bcr1525. - DOI - PMC - PubMed

LinkOut - more resources