. 2017 Jun 30;18(1):322.

doi: 10.1186/s12859-017-1729-2.

RGIFE: a ranked guided iterative feature elimination heuristic for the identification of biomarkers

Nicola Lazzarini¹, Jaume Bacardit²

Affiliations

¹ ICOS research group, School of Computing Science, Newcastle-upon-Tyne, UK.
² ICOS research group, School of Computing Science, Newcastle-upon-Tyne, UK. jaume.bacardit@newcastle.ac.uk.

PMID: 28666416
PMCID: PMC5493069
DOI: 10.1186/s12859-017-1729-2

RGIFE: a ranked guided iterative feature elimination heuristic for the identification of biomarkers

Nicola Lazzarini et al. BMC Bioinformatics. 2017.

. 2017 Jun 30;18(1):322.

doi: 10.1186/s12859-017-1729-2.

Authors

Nicola Lazzarini¹, Jaume Bacardit²

Affiliations

¹ ICOS research group, School of Computing Science, Newcastle-upon-Tyne, UK.
² ICOS research group, School of Computing Science, Newcastle-upon-Tyne, UK. jaume.bacardit@newcastle.ac.uk.

PMID: 28666416
PMCID: PMC5493069
DOI: 10.1186/s12859-017-1729-2

Abstract

Background: Current -omics technologies are able to sense the state of a biological sample in a very wide variety of ways. Given the high dimensionality that typically characterises these data, relevant knowledge is often hidden and hard to identify. Machine learning methods, and particularly feature selection algorithms, have proven very effective over the years at identifying small but relevant subsets of variables from a variety of application domains, including -omics data. Many methods exist with varying trade-off between the size of the identified variable subsets and the predictive power of such subsets. In this paper we focus on an heuristic for the identification of biomarkers called RGIFE: Rank Guided Iterative Feature Elimination. RGIFE is guided in its biomarker identification process by the information extracted from machine learning models and incorporates several mechanisms to ensure that it creates minimal and highly predictive features sets.

Results: We compare RGIFE against five well-known feature selection algorithms using both synthetic and real (cancer-related transcriptomics) datasets. First, we assess the ability of the methods to identify relevant and highly predictive features. Then, using a prostate cancer dataset as a case study, we look at the biological relevance of the identified biomarkers.

Conclusions: We propose RGIFE, a heuristic for the inference of reduced panels of biomarkers that obtains similar predictive performance to widely adopted feature selection methods while selecting significantly fewer feature. Furthermore, focusing on the case study, we show the higher biological relevance of the biomarkers selected by our approach. The RGIFE source code is available at: http://ico2s.org/software/rgife.html .

Keywords: Biomarkers; Feature reduction; Knowledge extraction; Machine learning.

PubMed Disclaimer

Figures

**Fig. 1**
Distribution of the accuracies, calculated using a 10-fold cross-validation, for different RGIFE policies. Each subplot represents the performance, obtained with ten different datasets, assessed with four classifiers

**Fig. 2**
Comparison of the number of selected attributes by different RGIFE policies. For each dataset is reported the average number of attributes obtained from the 10-fold cross-validation together with the standard deviation

**Fig. 3**
Comparison of the number of selected attributes by the RGIFE policies, CFS and the L1-based feature selection. For each dataset is reported the average number of attributes obtained from the 10-fold cross-validation together with the standard deviation

**Fig. 4**
Result of each iteration during the iterative feature elimination process when applied to two datasets (Breast and AML) for 3 different runs of RGIFE. *Green* and *blue* equal or better performance than the reference iteration. *Green* is used when the removed attributes were the bottom ranked, otherwise blue is employed. *Red* represents worse performance, *yellow* shows the identification of a soft-fail. The last *non-grey square* indicates the last iteration of the RGIFE run

**Fig. 5**
Analysis in a disease-context of the signatures selected by different methods. Two sources for the gene-disease associations were used. Each metric is referred to the number of disease-associated genes available in the signatures

**Fig. 6**
Normalised genomic alteration percentages of the signatures inferred for the case study. The alterations refer to the samples available from eight prostate cancer related datasets. The *bottom-right plot* shows the average ranks across the datasets. Higher rank indicates higher average alterations. The abbreviations and the colors for the plots are defined in the legend of the central subplot

**Fig. 7**
Graphical visualisation of the enriched terms (KEGG patwhays found by *ClueGO*) associated to the signature induced network nodes. Edges represent the relationship between terms based on their shared genes. The size of the node indicates the enrichment significance, the color gradient is proportional to the genes associated with the term. Only terms enriched with p-value < 0.05 are shown

See this image and copyright information in PMC

References

1. Group BDW. Biomarkers and surrogate endpoints: Preferred definitions and conceptual framework. Clin Pharmacol Ther. 2001;69(3):89–95. doi: 10.1067/mcp.2001.113989. - DOI - PubMed
1. Inza IN, Calvo B, Armañanzas R, Bengoetxea E, Larrañaga P, Lozano J. Bioinformatics Methods in Clinical Research. Methods in Molecular Biology. Springer: Humana Press; 2010. Machine learning: An indispensable tool in bioinformatics. - PubMed
1. Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y. Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics. 2010;26(3):392–8. doi: 10.1093/bioinformatics/btp630. - DOI - PubMed
1. Wang Y, Tetko IV, Hall MA, Frank E, Facius A, Mayer KFX, Mewes HW. Gene selection from microarray data for cancer classification—a machine learning approach. Comput Biol Chem. 2005;29(1):37–46. doi: 10.1016/j.compbiolchem.2004.11.001. - DOI - PubMed
1. Chen KH, Wang KJ, Tsai ML, Wang KM, Adrian AM, Cheng WC, Yang TS, Teng NC, Tan KP, Chang KS. Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm. BMC Bioinforma. 2014;15(1):49. doi: 10.1186/1471-2105-15-49. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

RGIFE: a ranked guided iterative feature elimination heuristic for the identification of biomarkers

Affiliations

RGIFE: a ranked guided iterative feature elimination heuristic for the identification of biomarkers

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources