Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Mar 22;20(2):492-503.
doi: 10.1093/bib/bbx124.

Evaluation of variable selection methods for random forests and omics data sets

Affiliations

Evaluation of variable selection methods for random forests and omics data sets

Frauke Degenhardt et al. Brief Bioinform. .

Abstract

Machine learning methods and in particular random forests are promising approaches for prediction based on high dimensional omics data sets. They provide variable importance measures to rank predictors according to their predictive power. If building a prediction model is the main goal of a study, often a minimal set of variables with good prediction performance is selected. However, if the objective is the identification of involved variables to find active networks and pathways, approaches that aim to select all relevant variables should be preferred. We evaluated several variable selection procedures based on simulated data as well as publicly available experimental methylation and gene expression data. Our comparison included the Boruta algorithm, the Vita method, recurrent relative variable importance, a permutation approach and its parametric variant (Altmann) as well as recursive feature elimination (RFE). In our simulation studies, Boruta was the most powerful approach, followed closely by the Vita method. Both approaches demonstrated similar stability in variable selection, while Vita was the most robust approach under a pure null model without any predictor variables related to the outcome. In the analysis of the different experimental data sets, Vita demonstrated slightly better stability in variable selection and was less computationally intensive than Boruta. In conclusion, we recommend the Boruta and Vita approaches for the analysis of high-dimensional data sets. Vita is considerably faster than Boruta and thus more suitable for large data sets, but only Boruta can also be applied in low-dimensional settings.

Keywords: feature selection; high dimensional data; machine learning; random forest; relevant variables.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Performance comparison in Simulation Study 1 based on simulations with true effects. Shown are FDRs versus sensitivity of the scenarios with a group size of 10 (A) and 50 (B) as well as RMSE versus stability for the scenarios with a group size of 10 (C) and 50 (D). Each subfigure displays the median as well as the interquartile range over all 50 replicates of each method using different plotting symbols and colors.
Figure 2
Figure 2
Empirical power to select causal variables in Simulation Study 1. Shown is the empirical power of each causal variable in the simulation scenarios with group sizes of 10 with a total of 30 causal variables (A) and 50 with a total of 150 causal variables (B). Each of the variable selection approaches is given in a different color.
Figure 3
Figure 3
Performance comparison in Simulation Study 1 based on null model. Shown are RMSE versus number of falsely selected variables of the scenarios with outcome simulated independently of any predictor variables using group sizes of 10 (A) and 50 (B). Each subfigure displays the median as well as the interquartile range over all 50 replicates of each method using different plotting symbols and color.
Figure 4
Figure 4
Performance comparison in Simulation Study 2. Shown are classification error versus stability (A) and empirical power depending on the absolute effect size (B). For classification error and stability the median as well as the interquartile range over all 50 pairs of replicates are displayed. For empirical power the median frequency per category of absolute effect size is given. Results for each method can be distinguished by plotting symbols and colors.
Figure 5
Figure 5
Performance comparison based on experimental data sets. Shown are classification error versus stability of the two experimental studies predicting sex (A) and estrogen receptor positive breast cancer (B). Each subfigure displays the median error and variable stability of the two different data sets that were analyzed for each research question using different plotting symbols and shades of gray. Note that a different definition of stability is used in subfigure (B), which is defined relative to the minimum and not the union of the two sets of selected variables.
Figure 6
Figure 6
Run time comparison based on the classification of experimental data sets. Shown are run times (in hours) for each method as an average of the two data sets of each research question.

References

    1. Breiman L. Random forests. Mach Learn 2001;45:5–32.
    1. Szymczak S, Biernacka JM, Cordell HJ, et al.Machine learning in genome-wide association studies. Genet Epidemiol 2009;33:S51–7. - PubMed
    1. Alexe G, Monaco J, Doyle S, et al.Towards improved cancer diagnosis and prognosis using analysis of gene expression data and computer aided imaging. Exp Biol Med 2009;234:860–79. - PubMed
    1. Wilhelm T. Phenotype prediction based on genome-wide DNA methylation data. BMC Bioinformatics 2014;15:193.. - PMC - PubMed
    1. Swan AL, Mobasheri A, Allaway D, et al.Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. Omics 2013;17:595–610. - PMC - PubMed

Publication types

Substances