A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification
- PMID: 34451013
- PMCID: PMC8402295
- DOI: 10.3390/s21165571
A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification
Abstract
In machine learning and data science, feature selection is considered as a crucial step of data preprocessing. When we directly apply the raw data for classification or clustering purposes, sometimes we observe that the learning algorithms do not perform well. One possible reason for this is the presence of redundant, noisy, and non-informative features or attributes in the datasets. Hence, feature selection methods are used to identify the subset of relevant features that can maximize the model performance. Moreover, due to reduction in feature dimension, both training time and storage required by the model can be reduced as well. In this paper, we present a tri-stage wrapper-filter-based feature selection framework for the purpose of medical report-based disease detection. In the first stage, an ensemble was formed by four filter methods-Mutual Information, ReliefF, Chi Square, and Xvariance-and then each feature from the union set was assessed by three classification algorithms-support vector machine, naïve Bayes, and k-nearest neighbors-and an average accuracy was calculated. The features with higher accuracy were selected to obtain a preliminary subset of optimal features. In the second stage, Pearson correlation was used to discard highly correlated features. In these two stages, XGBoost classification algorithm was applied to obtain the most contributing features that, in turn, provide the best optimal subset. Then, in the final stage, we fed the obtained feature subset to a meta-heuristic algorithm, called whale optimization algorithm, in order to further reduce the feature set and to achieve higher accuracy. We evaluated the proposed feature selection framework on four publicly available disease datasets taken from the UCI machine learning repository, namely, arrhythmia, leukemia, DLBCL, and prostate cancer. Our obtained results confirm that the proposed method can perform better than many state-of-the-art methods and can detect important features as well. Less features ensure less medical tests for correct diagnosis, thus saving both time and cost.
Keywords: arrhythmia; cancer dataset; disease classification; feature selection; filter method; whale optimization algorithm; wrapper method.
Conflict of interest statement
The authors declare no conflict of interests.
Figures




Similar articles
-
R-HEFS: Rough set based heterogeneous ensemble feature selection method for medical data classification.Artif Intell Med. 2021 Apr;114:102049. doi: 10.1016/j.artmed.2021.102049. Epub 2021 Mar 6. Artif Intell Med. 2021. PMID: 33875164
-
Upper-Limb Motion Recognition Based on Hybrid Feature Selection: Algorithm Development and Validation.JMIR Mhealth Uhealth. 2021 Sep 2;9(9):e24402. doi: 10.2196/24402. JMIR Mhealth Uhealth. 2021. PMID: 34473067 Free PMC article.
-
R-Ensembler: A greedy rough set based ensemble attribute selection algorithm with kNN imputation for classification of medical data.Comput Methods Programs Biomed. 2020 Feb;184:105122. doi: 10.1016/j.cmpb.2019.105122. Epub 2019 Oct 8. Comput Methods Programs Biomed. 2020. PMID: 31622857
-
Multi-file dynamic compression method based on classification algorithm in DNA storage.Med Biol Eng Comput. 2024 Dec;62(12):3623-3635. doi: 10.1007/s11517-024-03156-2. Epub 2024 Jun 26. Med Biol Eng Comput. 2024. PMID: 38922373 Review.
-
A new feature selection approach with binary exponential henry gas solubility optimization and hybrid data transformation methods.MethodsX. 2024 May 20;12:102770. doi: 10.1016/j.mex.2024.102770. eCollection 2024 Jun. MethodsX. 2024. PMID: 39677828 Free PMC article. Review.
Cited by
-
Effective hybrid feature selection using different bootstrap enhances cancers classification performance.BioData Min. 2022 Sep 30;15(1):24. doi: 10.1186/s13040-022-00304-y. BioData Min. 2022. PMID: 36175944 Free PMC article.
-
A New Framework for Precise Identification of Prostatic Adenocarcinoma.Sensors (Basel). 2022 Feb 26;22(5):1848. doi: 10.3390/s22051848. Sensors (Basel). 2022. PMID: 35270995 Free PMC article.
-
Fine-Tuned DenseNet-169 for Breast Cancer Metastasis Prediction Using FastAI and 1-Cycle Policy.Sensors (Basel). 2022 Apr 13;22(8):2988. doi: 10.3390/s22082988. Sensors (Basel). 2022. PMID: 35458972 Free PMC article.
-
Prediction of the Age and Gender Based on Human Face Images Based on Deep Learning Algorithm.Comput Math Methods Med. 2022 Aug 24;2022:1413597. doi: 10.1155/2022/1413597. eCollection 2022. Comput Math Methods Med. 2022. PMID: 36060657 Free PMC article.
-
Cancerous Tumor Controlled Treatment Using Search Heuristic (GA)-Based Sliding Mode and Synergetic Controller.Cancers (Basel). 2022 Aug 29;14(17):4191. doi: 10.3390/cancers14174191. Cancers (Basel). 2022. PMID: 36077727 Free PMC article.
References
-
- Ghosh M., Guha R., Singh P.K., Bhateja V., Sarkar R. A histogram based fuzzy ensemble technique for feature selection. Evol. Intell. 2019;12:713–724. doi: 10.1007/s12065-019-00279-6. - DOI
-
- Ghosh K.K., Ahmed S., Singh P.K., Geem Z.W., Sarkar R. Improved Binary Sailfish Optimizer Based on Adaptive β-Hill Climbing for Feature Selection. IEEE Access. 2020;8:83548–83560. doi: 10.1109/ACCESS.2020.2991543. - DOI
-
- Duval B., Hao J.-K., Hernandez J.C.H. A memetic algorithm for gene selection and molecular classification of cancer; Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, GECCO ‘09; Montreal, QC, Canada. 8–12 July 2009; pp. 201–208. - DOI
-
- Chandrashekar G., Sahin F. A survey on feature selection methods. Comput. Electr. Eng. 2014;40:16–28. doi: 10.1016/j.compeleceng.2013.11.024. - DOI
-
- Lu H., Chen J., Yan K., Jin Q., Xue Y., Gao Z. A hybrid feature selection algorithm for gene expression data classification. Neurocomputing. 2017;256:56–62. doi: 10.1016/j.neucom.2016.07.080. - DOI
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources