Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 May;14(3):315-26.
doi: 10.1093/bib/bbs034. Epub 2012 Jul 10.

Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

Affiliations

Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

Wouter G Touw et al. Brief Bioinform. 2013 May.

Abstract

In the Life Sciences 'omics' data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of the same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by the algorithm during the creation of the classification model. This review details some of the to the best of our knowledge rarely or never used RF properties that allow maximizing the biological insights that can be extracted from complex omics data sets using RF.

Keywords: Random Forest; conditional relationships; local importance; proximity; variable importance; variable interaction.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Training of an individual tree of an RFM. The tree is built based on a data matrix (shown within the ellipses). This matrix consists of samples (S1–S10; e.g. individuals) belonging to two classes (encircled crosses or encircled plus signs; e.g. healthy and ill) and measurements for each sample for different variables (V1-V5; e.g. SNPs). Dice: random selection. Dashed lines: randomly selected samples and variables. For each tree, a bootstrap set is created by sampling samples from the data set at random and with replacement until it contains as many samples as there are in the data set. The random selection will contain about 63% of the samples in the original data set. In this example, the bootstrap set contains seven unique samples (samples S3–S9; non-selected samples S1, S2 and S10 are faded). For every node (indicated as ellipses) a few variables are randomly selected (here three; the other two non-selected variables are shown faded; by default RF selects the square root of the total number of variables) and evaluated for their ability to split the data. The variable resulting in the largest decrease in impurity is chosen to define the splitting rule. In case of the top node, this is V4 and for the second node on the left hand side this is V2 (indicated with the black arrows). This process is repeated until the nodes are pure (so called leaves; indicated with round-edged boxes): they contain samples of the same class (encircled cross or plus signs).
Figure 2:
Figure 2:
Concept visualization of how relations between variables and samples could be represented following the dissection of the trees in a random forest. In this hypothetical case, a supervised classification was performed on samples from two classes (encircled crosses or encircled plus signs; e.g. healthy individuals or patients). Dissection of the random forest trees might result in the further (unsupervised) distinction of subsets of samples. Top panel: variables (V1-Vn; e.g. SNPs in a GWAS study), their values (1 or 0) and interactions. Bottom panel: subsets (separated by the dashed lines) of samples from the pure classes that are predicted by a given interaction between variables. An interpretation example: provided that SNP4 (V4) is present, SNP2 (V2) allows the distinction between two subsets (consisting of healthy individuals 6, 7 8, 9 and patients 2, 5 and s). If SNP4 is absent, then the patient samples 1, 3, 4 and t can be classified. In case SNP1 (V1) is absent and SNP5 (V5) is present, a subset of healthy individuals consisting of samples a, b, c and d can be classified. Note that in this example, there can apparently no subset be distinguished if SNP1 (V1) is present or SNP5 (V5) is absent.

Comment in

References

    1. Ideker T, Galitski T, Hood L. A new approach to decoding life: systems biology. Annu Rev Genomics Hum Genet. 2001;2:343–72. - PubMed
    1. Kitano H. Systems biology: a brief overview. Science. 2002;295:1662–4. - PubMed
    1. Chuang H-Y, Hofree M, Ideker T. A decade of systems biology. Annu Rev Cell Dev Biol. 2010;26:721–44. - PMC - PubMed
    1. Ghosh S, Matsuoka Y, Asai Y, et al. Software for systems biology: from tools to integrated platforms. Nat Rev Genet. 2011;12:821–32. - PubMed
    1. Gehlenborg N, O’Donoghue SI, Baliga NS, et al. Visualization of omics data for systems biology. Nat Methods. 2010;7:S56–68. - PubMed