Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Oct 12;18(20):10670.
doi: 10.3390/ijerph182010670.

Evaluation of Feature Selection Techniques for Breast Cancer Risk Prediction

Affiliations

Evaluation of Feature Selection Techniques for Breast Cancer Risk Prediction

Nahúm Cueto López et al. Int J Environ Res Public Health. .

Abstract

This study evaluates several feature ranking techniques together with some classifiers based on machine learning to identify relevant factors regarding the probability of contracting breast cancer and improve the performance of risk prediction models for breast cancer in a healthy population. The dataset with 919 cases and 946 controls comes from the MCC-Spain study and includes only environmental and genetic features. Breast cancer is a major public health problem. Our aim is to analyze which factors in the cancer risk prediction model are the most important for breast cancer prediction. Likewise, quantifying the stability of feature selection methods becomes essential before trying to gain insight into the data. This paper assesses several feature selection algorithms in terms of performance for a set of predictive models. Furthermore, their robustness is quantified to analyze both the similarity between the feature selection rankings and their own stability. The ranking provided by the SVM-RFE approach leads to the best performance in terms of the area under the ROC curve (AUC) metric. Top-47 ranked features obtained with this approach fed to the Logistic Regression classifier achieve an AUC = 0.616. This means an improvement of 5.8% in comparison with the full feature set. Furthermore, the SVM-RFE ranking technique turned out to be highly stable (as well as Random Forest), whereas relief and the wrapper approaches are quite unstable. This study demonstrates that the stability and performance of the model should be studied together as Random Forest and SVM-RFE turned out to be the most stable algorithms, but in terms of model performance SVM-RFE outperforms Random Forest.

Keywords: breast cancer; feature selection; risk prediction model; stability.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Area under the curve using the complete data set without reducing features and different cardinality of the subset of features for different classifiers: AdaBoost.
Figure 2
Figure 2
Area under the curve using the complete data set without reducing features and different cardinality of the subset of features for different classifiers: k-NN.
Figure 3
Figure 3
Area under the curve using the complete data set without reducing features and different cardinality of the subset of features for different classifiers: Logistic Regression.
Figure 4
Figure 4
Area under the curve using the complete data set without reducing features and different cardinality of the subset of features for different classifiers: Multilayer Perceptron.
Figure 5
Figure 5
Area under the curve using the complete data set without reducing features and different cardinality of the subset of features for different classifiers: SVM.
Figure 6
Figure 6
Feature selector stability: (a) Jaccard index for feature subsets with different cardinality; (b) MDS plot of the feature ranking algorithms.
Figure 7
Figure 7
Performance for pre-menopausal data partition (by classifier): (a) BT, (b) k-NN, (c) LR, (d) MLP, (e) SVM.
Figure 8
Figure 8
Performance for postmenopausal data partition (by classifier): (a) BT, (b) k-NN, (c) LR, (d) MLP, (e) SVM.

References

    1. Bray F., Ferlay J., Soerjomataram I., Siegel R.L., Torre L.A., Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2018;68:394–424. doi: 10.3322/caac.21492. - DOI - PubMed
    1. Mohanty S.S., Mohanty P.K. Obesity as potential breast cancer risk factor for postmenopausal women. Genes Dis. 2019;8:117–123. doi: 10.1016/j.gendis.2019.09.006. - DOI - PMC - PubMed
    1. Monninkhof E.M., Elias S.G., Vlems F.A., Van Der Tweel I., Schuit A.J., Voskuil D.W., Van Leeuwen F.E. Physical activity and breast cancer: A systematic review. Epidemiology. 2007;18:137–157. doi: 10.1097/01.ede.0000251167.75581.98. - DOI - PubMed
    1. Hamajima N., Hirose K., Tajima K., Rohan T., Calle E.E., Heath C.W., Coates R.J., Liff J.M., Talamini R., Chantarakul N., et al. Alcohol, tobacco and breast cancer—Collaborative reanalysis of individual data from 53 epidemiological studies, including 58 515 women with breast cancer and 95 067 women without the disease. Br. J. Cancer. 2002;87:1234–1245. doi: 10.1038/sj.bjc.6600596. - DOI - PMC - PubMed
    1. Sun Y.S., Zhao Z., Yang Z.N., Xu F., Lu H.J., Zhu Z.Y., Shi W., Jiang J., Yao P.P., Zhu H.P. Risk factors and preventions of breast cancer. Int. J. Biol. Sci. 2017;13:1387. doi: 10.7150/ijbs.21635. - DOI - PMC - PubMed

Publication types