Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug 18;21(16):5571.
doi: 10.3390/s21165571.

A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification

Affiliations

A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification

Moumita Mandal et al. Sensors (Basel). .

Abstract

In machine learning and data science, feature selection is considered as a crucial step of data preprocessing. When we directly apply the raw data for classification or clustering purposes, sometimes we observe that the learning algorithms do not perform well. One possible reason for this is the presence of redundant, noisy, and non-informative features or attributes in the datasets. Hence, feature selection methods are used to identify the subset of relevant features that can maximize the model performance. Moreover, due to reduction in feature dimension, both training time and storage required by the model can be reduced as well. In this paper, we present a tri-stage wrapper-filter-based feature selection framework for the purpose of medical report-based disease detection. In the first stage, an ensemble was formed by four filter methods-Mutual Information, ReliefF, Chi Square, and Xvariance-and then each feature from the union set was assessed by three classification algorithms-support vector machine, naïve Bayes, and k-nearest neighbors-and an average accuracy was calculated. The features with higher accuracy were selected to obtain a preliminary subset of optimal features. In the second stage, Pearson correlation was used to discard highly correlated features. In these two stages, XGBoost classification algorithm was applied to obtain the most contributing features that, in turn, provide the best optimal subset. Then, in the final stage, we fed the obtained feature subset to a meta-heuristic algorithm, called whale optimization algorithm, in order to further reduce the feature set and to achieve higher accuracy. We evaluated the proposed feature selection framework on four publicly available disease datasets taken from the UCI machine learning repository, namely, arrhythmia, leukemia, DLBCL, and prostate cancer. Our obtained results confirm that the proposed method can perform better than many state-of-the-art methods and can detect important features as well. Less features ensure less medical tests for correct diagnosis, thus saving both time and cost.

Keywords: arrhythmia; cancer dataset; disease classification; feature selection; filter method; whale optimization algorithm; wrapper method.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interests.

Figures

Figure 1
Figure 1
Flowchart of our proposed tri-phase hybrid wrapper-filter feature selection method.
Figure 2
Figure 2
Comparison of accuracies obtained on four disease datasets using KNN, SVM, and NB classifiers without any feature selection.
Figure 3
Figure 3
Comparison of accuracies, number of features, and computational time obtained on four disease datasets using our proposed tri-stage wrapper-filter feature selection method.
Figure 4
Figure 4
Comparison of number of features obtained by each phase of our proposed tri-stage wrapper-filter feature selection method for all the four disease datasets considering the highest accuracy achieved in each phase.

Similar articles

Cited by

References

    1. Ghosh M., Guha R., Singh P.K., Bhateja V., Sarkar R. A histogram based fuzzy ensemble technique for feature selection. Evol. Intell. 2019;12:713–724. doi: 10.1007/s12065-019-00279-6. - DOI
    1. Ghosh K.K., Ahmed S., Singh P.K., Geem Z.W., Sarkar R. Improved Binary Sailfish Optimizer Based on Adaptive β-Hill Climbing for Feature Selection. IEEE Access. 2020;8:83548–83560. doi: 10.1109/ACCESS.2020.2991543. - DOI
    1. Duval B., Hao J.-K., Hernandez J.C.H. A memetic algorithm for gene selection and molecular classification of cancer; Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, GECCO ‘09; Montreal, QC, Canada. 8–12 July 2009; pp. 201–208. - DOI
    1. Chandrashekar G., Sahin F. A survey on feature selection methods. Comput. Electr. Eng. 2014;40:16–28. doi: 10.1016/j.compeleceng.2013.11.024. - DOI
    1. Lu H., Chen J., Yan K., Jin Q., Xue Y., Gao Z. A hybrid feature selection algorithm for gene expression data classification. Neurocomputing. 2017;256:56–62. doi: 10.1016/j.neucom.2016.07.080. - DOI

LinkOut - more resources