Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 2;15(1):22971.
doi: 10.1038/s41598-025-06426-7.

A novel double machine learning approach for detecting early breast cancer using advanced feature selection and dimensionality reduction techniques

Affiliations

A novel double machine learning approach for detecting early breast cancer using advanced feature selection and dimensionality reduction techniques

Suganya Athisayamani et al. Sci Rep. .

Abstract

In this paper, three Double Machine Learning (DML) models are proposed to enhance the accuracy of breast cancer detection using machine learning techniques using breast cancer detection dataset. The DML models learn the primary features using machine learning and deep learning models. Then, these features are fused by a meta-classifier to achieve the best classification performance. The first DML model combines the interpretability of Random Forest (RF) with the deep learning capabilities of a Feedforward Neural Network (FNN). RF processes structured features, providing class probabilities and feature importance scores, while the FNN learns non-linear relationships and generates embeddings. These outputs are fused into a combined feature vector, which is then used by a meta-classifier for final predictions. This approach effectively captures both structured features and non-linear patterns, making it suitable for datasets with complex dependencies. The second model pairs eXtreme Gradient Boosting (XGBoost), a highly efficient boosting algorithm for tabular data, with an Artificial Neural Network (ANN). XGBoost optimizes decision tree ensembles and provides class probabilities, while the ANN processes numerical data to learn deeper representations. A meta-classifier then uses the fused outputs from both XGBoost and ANN for final predictions. This model is particularly effective for datasets combining structured features (handled by XGBoost) with numerical features (handled by ANN). The third model integrates LightGBM, a fast and scalable gradient-boosting framework, with an ANN, which is well-suited for analyzing sequential data. LightGBM processes structured features to provide probabilities and importance scores, while the ANN learns temporal dependencies from sequential data. The outputs from LightGBM and ANN are concatenated and passed into a meta-classifier for decision-making. This model is ideal for datasets with both static features (LightGBM) and continuous data (ANN), such as time-series datasets or datasets with sequential dependencies. These DML models, when combined with dimensionality reduction (PCA) and feature selection, significantly improve the performance of breast cancer detection systems by leveraging both structured and sequential data with high accuracy of 0.99.

Keywords: Breast Cancer; Decision Tree; Double machine learning; Feature Selection; KNN; Machine Learning; PCA; Random Forest; SVM.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Heatmap between columns of the dataset.
Fig. 2
Fig. 2
The correlation coefficients of the target value and top five features.
Algorithm 1
Algorithm 1
Principal component analysis
Fig. 3
Fig. 3
Proposed classification framework.
Fig. 4
Fig. 4
Proposed ANN model.
Algorithm 2
Algorithm 2
Random Forest + Feedforward Neural Network (FNN)
Algorithm 3
Algorithm 3
XGBoost + Artificial Neural Network (ANN)
Algorithm 4
Algorithm 4
LightGBM + Artificial Neural Network (ANN)
Fig. 5
Fig. 5
ROC with Decision tree using all 30 features.
Fig. 6
Fig. 6
ROC with ANN using all 30 features.
Fig. 7
Fig. 7
(a) Accuracy and (b) loss on training and validation with all 30 features using ANN.
Fig. 8
Fig. 8
Statistical distribution of performance metrics with all features.
Fig. 9
Fig. 9
(a) Accuracy and (b) loss on training and validation with top 5 features using ANN.
Fig. 10
Fig. 10
Statistical distribution of performance metrics with top 5 features.
Fig. 11
Fig. 11
(a) Accuracy and (b) loss on training and validation with 16 principal components using ANN.
Fig. 12
Fig. 12
Statistical distribution of performance metrics with 16 principal components.
Fig. 13
Fig. 13
(a) Accuracy and (b) loss on training and validation with 8 principal components using ANN.
Fig. 14
Fig. 14
Statistical distribution of performance metrics with 8 principal components.
Fig. 15
Fig. 15
(a) Accuracy and (b) loss on training and validation with 8 principal components using ANN.
Fig. 16
Fig. 16
Statistical distribution of performance metrics with 4 principal components.
Fig. 17
Fig. 17
(a) Accuracy and (b) loss on training and validation with 2 principal components using ANN.
Fig. 18
Fig. 18
Statistical distribution of performance metrics with 2 principal components.

Similar articles

References

    1. Agarwal, A., Ranjithamani, A., Velayudham, A., Shunmugam, A. & Ismail, M. Machine learning technique for the assembly-based image classification system. J. Nucl. Energy Sci. & Power Gener. Technol. 10 (2021).
    1. Panchal, B. Breast cancer detection dataset. Available at: https://www.kaggle.com/datasets/bittupanchal/breast-cancer-detection-dat... (2021).
    1. Ullah, Z., Qi, L., Binu, D., Rajakumar, B. & Ismail, B. 2-d canonical correlation analysis based image super-resolution scheme for facial emotion recognition. Multimed. Tools Appl.81, 13911–13934. 10.1007/s11042-022-11922-3 (2022).
    1. Mayo Clinic. Breast cancer symptoms and causes. Available at https://www.mayoclinic.org/diseases-conditions/breast-cancer/symptoms-ca... (2023). Accessed: 2025-01-16.
    1. Delen, D., Walker, G. & Kadam, A. Predicting breast cancer survivability: a comparison of three data mining methods. Artif. Intell. Medicine34, 113–127. 10.1016/J.ARTMED.2004.07.002 (2005). - PubMed

LinkOut - more resources