Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

Wouter G Touw¹, Jumamurat R Bayjanov, Lex Overmars, Lennart Backus, Jos Boekhorst, Michiel Wels, Sacha A F T van Hijum

Affiliations

PMID: 22786785
PMCID: PMC3659301
DOI: 10.1093/bib/bbs034

Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

Wouter G Touw et al. Brief Bioinform. 2013 May.

. 2013 May;14(3):315-26.

doi: 10.1093/bib/bbs034. Epub 2012 Jul 10.

Authors

Wouter G Touw¹, Jumamurat R Bayjanov, Lex Overmars, Lennart Backus, Jos Boekhorst, Michiel Wels, Sacha A F T van Hijum

Affiliation

¹ Radboud University of Nijmegen, the Netherlands.

PMID: 22786785
PMCID: PMC3659301
DOI: 10.1093/bib/bbs034

Abstract

In the Life Sciences 'omics' data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of the same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by the algorithm during the creation of the classification model. This review details some of the to the best of our knowledge rarely or never used RF properties that allow maximizing the biological insights that can be extracted from complex omics data sets using RF.

Keywords: Random Forest; conditional relationships; local importance; proximity; variable importance; variable interaction.

PubMed Disclaimer

Figures

**Figure 1:**
Training of an individual tree of an RFM. The tree is built based on a data matrix (shown within the ellipses). This matrix consists of samples (S1–S10; e.g. individuals) belonging to two classes (encircled crosses or encircled plus signs; e.g. healthy and ill) and measurements for each sample for different variables (V1-V5; e.g. SNPs). Dice: random selection. Dashed lines: randomly selected samples and variables. For each tree, a bootstrap set is created by sampling samples from the data set at random and with replacement until it contains as many samples as there are in the data set. The random selection will contain about 63% of the samples in the original data set. In this example, the bootstrap set contains seven unique samples (samples S3–S9; non-selected samples S1, S2 and S10 are faded). For every node (indicated as ellipses) a few variables are randomly selected (here three; the other two non-selected variables are shown faded; by default RF selects the square root of the total number of variables) and evaluated for their ability to split the data. The variable resulting in the largest decrease in impurity is chosen to define the splitting rule. In case of the top node, this is V4 and for the second node on the left hand side this is V2 (indicated with the black arrows). This process is repeated until the nodes are pure (so called leaves; indicated with round-edged boxes): they contain samples of the same class (encircled cross or plus signs).

**Figure 2:**
Concept visualization of how relations between variables and samples could be represented following the dissection of the trees in a random forest. In this hypothetical case, a supervised classification was performed on samples from two classes (encircled crosses or encircled plus signs; e.g. healthy individuals or patients). Dissection of the random forest trees might result in the further (unsupervised) distinction of subsets of samples. Top panel: variables (V1-Vn; e.g. SNPs in a GWAS study), their values (1 or 0) and interactions. Bottom panel: subsets (separated by the dashed lines) of samples from the pure classes that are predicted by a given interaction between variables. An interpretation example: provided that SNP4 (V4) is present, SNP2 (V2) allows the distinction between two subsets (consisting of healthy individuals 6, 7 8, 9 and patients 2, 5 and s). If SNP4 is absent, then the patient samples 1, 3, 4 and t can be classified. In case SNP1 (V1) is absent and SNP5 (V5) is present, a subset of healthy individuals consisting of samples a, b, c and d can be classified. Note that in this example, there can apparently no subset be distinguished if SNP1 (V1) is present or SNP5 (V5) is absent.

See this image and copyright information in PMC

Comment in

Letter to the Editor: On the term 'interaction' and related phrases in the literature on Random Forests.
Boulesteix AL, Janitza S, Hapfelmeier A, Van Steen K, Strobl C. Boulesteix AL, et al. Brief Bioinform. 2015 Mar;16(2):338-45. doi: 10.1093/bib/bbu012. Epub 2014 Apr 9. Brief Bioinform. 2015. PMID: 24723569 Free PMC article.

References

1. Ideker T, Galitski T, Hood L. A new approach to decoding life: systems biology. Annu Rev Genomics Hum Genet. 2001;2:343–72. - PubMed
1. Kitano H. Systems biology: a brief overview. Science. 2002;295:1662–4. - PubMed
1. Chuang H-Y, Hofree M, Ideker T. A decade of systems biology. Annu Rev Cell Dev Biol. 2010;26:721–44. - PMC - PubMed
1. Ghosh S, Matsuoka Y, Asai Y, et al. Software for systems biology: from tools to integrated platforms. Nat Rev Genet. 2011;12:821–32. - PubMed
1. Gehlenborg N, O’Donoghue SI, Baliga NS, et al. Visualization of omics data for systems biology. Nat Methods. 2010;7:S56–68. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

Affiliation

Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

Authors

Affiliation

Abstract

Figures

Comment in

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources