Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jun 30;18(1):322.
doi: 10.1186/s12859-017-1729-2.

RGIFE: a ranked guided iterative feature elimination heuristic for the identification of biomarkers

Affiliations

RGIFE: a ranked guided iterative feature elimination heuristic for the identification of biomarkers

Nicola Lazzarini et al. BMC Bioinformatics. .

Abstract

Background: Current -omics technologies are able to sense the state of a biological sample in a very wide variety of ways. Given the high dimensionality that typically characterises these data, relevant knowledge is often hidden and hard to identify. Machine learning methods, and particularly feature selection algorithms, have proven very effective over the years at identifying small but relevant subsets of variables from a variety of application domains, including -omics data. Many methods exist with varying trade-off between the size of the identified variable subsets and the predictive power of such subsets. In this paper we focus on an heuristic for the identification of biomarkers called RGIFE: Rank Guided Iterative Feature Elimination. RGIFE is guided in its biomarker identification process by the information extracted from machine learning models and incorporates several mechanisms to ensure that it creates minimal and highly predictive features sets.

Results: We compare RGIFE against five well-known feature selection algorithms using both synthetic and real (cancer-related transcriptomics) datasets. First, we assess the ability of the methods to identify relevant and highly predictive features. Then, using a prostate cancer dataset as a case study, we look at the biological relevance of the identified biomarkers.

Conclusions: We propose RGIFE, a heuristic for the inference of reduced panels of biomarkers that obtains similar predictive performance to widely adopted feature selection methods while selecting significantly fewer feature. Furthermore, focusing on the case study, we show the higher biological relevance of the biomarkers selected by our approach. The RGIFE source code is available at: http://ico2s.org/software/rgife.html .

Keywords: Biomarkers; Feature reduction; Knowledge extraction; Machine learning.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Distribution of the accuracies, calculated using a 10-fold cross-validation, for different RGIFE policies. Each subplot represents the performance, obtained with ten different datasets, assessed with four classifiers
Fig. 2
Fig. 2
Comparison of the number of selected attributes by different RGIFE policies. For each dataset is reported the average number of attributes obtained from the 10-fold cross-validation together with the standard deviation
Fig. 3
Fig. 3
Comparison of the number of selected attributes by the RGIFE policies, CFS and the L1-based feature selection. For each dataset is reported the average number of attributes obtained from the 10-fold cross-validation together with the standard deviation
Fig. 4
Fig. 4
Result of each iteration during the iterative feature elimination process when applied to two datasets (Breast and AML) for 3 different runs of RGIFE. Green and blue equal or better performance than the reference iteration. Green is used when the removed attributes were the bottom ranked, otherwise blue is employed. Red represents worse performance, yellow shows the identification of a soft-fail. The last non-grey square indicates the last iteration of the RGIFE run
Fig. 5
Fig. 5
Analysis in a disease-context of the signatures selected by different methods. Two sources for the gene-disease associations were used. Each metric is referred to the number of disease-associated genes available in the signatures
Fig. 6
Fig. 6
Normalised genomic alteration percentages of the signatures inferred for the case study. The alterations refer to the samples available from eight prostate cancer related datasets. The bottom-right plot shows the average ranks across the datasets. Higher rank indicates higher average alterations. The abbreviations and the colors for the plots are defined in the legend of the central subplot
Fig. 7
Fig. 7
Graphical visualisation of the enriched terms (KEGG patwhays found by ClueGO) associated to the signature induced network nodes. Edges represent the relationship between terms based on their shared genes. The size of the node indicates the enrichment significance, the color gradient is proportional to the genes associated with the term. Only terms enriched with p-value < 0.05 are shown

References

    1. Group BDW. Biomarkers and surrogate endpoints: Preferred definitions and conceptual framework. Clin Pharmacol Ther. 2001;69(3):89–95. doi: 10.1067/mcp.2001.113989. - DOI - PubMed
    1. Inza IN, Calvo B, Armañanzas R, Bengoetxea E, Larrañaga P, Lozano J. Bioinformatics Methods in Clinical Research. Methods in Molecular Biology. Springer: Humana Press; 2010. Machine learning: An indispensable tool in bioinformatics. - PubMed
    1. Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y. Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics. 2010;26(3):392–8. doi: 10.1093/bioinformatics/btp630. - DOI - PubMed
    1. Wang Y, Tetko IV, Hall MA, Frank E, Facius A, Mayer KFX, Mewes HW. Gene selection from microarray data for cancer classification—a machine learning approach. Comput Biol Chem. 2005;29(1):37–46. doi: 10.1016/j.compbiolchem.2004.11.001. - DOI - PubMed
    1. Chen KH, Wang KJ, Tsai ML, Wang KM, Adrian AM, Cheng WC, Yang TS, Teng NC, Tan KP, Chang KS. Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm. BMC Bioinforma. 2014;15(1):49. doi: 10.1186/1471-2105-15-49. - DOI - PMC - PubMed