Convergent Random Forest predictor: methodology for predicting drug response from genome-scale data applied to anti-TNF response

Jadwiga R Bienkowska¹, Gul S Dalgin, Franak Batliwalla, Normand Allaire, Ronenn Roubenoff, Peter K Gregersen, John P Carulli

Affiliations

PMID: 19699293
PMCID: PMC4476397
DOI: 10.1016/j.ygeno.2009.08.008

Comparative Study

Convergent Random Forest predictor: methodology for predicting drug response from genome-scale data applied to anti-TNF response

Jadwiga R Bienkowska et al. Genomics. 2009 Dec.

. 2009 Dec;94(6):423-32.

doi: 10.1016/j.ygeno.2009.08.008. Epub 2009 Aug 20.

Authors

Jadwiga R Bienkowska¹, Gul S Dalgin, Franak Batliwalla, Normand Allaire, Ronenn Roubenoff, Peter K Gregersen, John P Carulli

Affiliation

¹ Biogen IDEC 14 Cambridge Ctr, Cambridge, MA 02142, USA. Jadwiga.Bienkowska@biogenidec.com

PMID: 19699293
PMCID: PMC4476397
DOI: 10.1016/j.ygeno.2009.08.008

Abstract

Biomarker development for prediction of patient response to therapy is one of the goals of molecular profiling of human tissues. Due to the large number of transcripts, relatively limited number of samples, and high variability of data, identification of predictive biomarkers is a challenge for data analysis. Furthermore, many genes may be responsible for drug response differences, but often only a few are sufficient for accurate prediction. Here we present an analysis approach, the Convergent Random Forest (CRF) method, for the identification of highly predictive biomarkers. The aim is to select from genome-wide expression data a small number of non-redundant biomarkers that could be developed into a simple and robust diagnostic tool. Our method combines the Random Forest classifier and gene expression clustering to rank and select a small number of predictive genes. We evaluated the CRF approach by analyzing four different data sets. The first set contains transcript profiles of whole blood from rheumatoid arthritis patients, collected before anti-TNF treatment, and their subsequent response to the therapy. In this set, CRF identified 8 transcripts predicting response to therapy with 89% accuracy. We also applied the CRF to the analysis of three previously published expression data sets. For all sets, we have compared the CRF and recursive support vector machines (RSVM) approaches to feature selection and classification. In all cases the CRF selects much smaller number of features, five to eight genes, while achieving similar or better performance on both training and independent testing sets of data. For both methods performance estimates using cross-validation is similar to performance on independent samples. The method has been implemented in R and is available from the authors upon request: Jadwiga.Bienkowska@biogenidec.com.

PubMed Disclaimer

Figures

**Figure 1**
Error rate distributions at *mtry* ={45} (10 runs each) for three gene groups: (a-45) all 166 genes (f-45) final 40 genes that converged at *mtry* ={45} (imp-45) first 40 genes selected by importance among 166.

**Figure 2**
Change of OOB error rate with the number of genes selected by two methods importance and clustering ranking in one example run of the RF. The black line corresponds to genes selected by importance ranking of the initial 166 genes set. Blue circles represent 40 convergent genes ranked by the importance measure. Minimum error (11%) is obtained with k = 24 genes, circled in red. Red circles correspond to the error rate with k-best genes selected from k clusters (x axis). Minimum error (11%) is obtained with k = 8 genes, circled in red.

**Figure 3**
Number of genes (y axis) with minimum error (x axis) obtained by importance ranking (red) and clustering (blue) in 50 separate random forest runs (z axis).

See this image and copyright information in PMC

References

1. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning. 2002;46:389–422.
1. Krishnapuram B, Carin L, Hartemink AJ. Joint classifier and feature optimization for comprehensive cancer diagnosis using gene expression data. Journal of Computational Biology. 2004;11:227–42. - PubMed
1. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–7. - PubMed
1. Breiman L. Random forests. Machine Learning. 2001;45:5–32.
1. Enot DP, Beckmann M, Overy D, Draper J. Predicting interpretability of metabolome models based on behavior, putative identity, and biological relevance of explanatory signals. Proc Natl Acad Sci U S A. 2006;103:14865–70. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Medical
- MedlinePlus Health Information
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Convergent Random Forest predictor: methodology for predicting drug response from genome-scale data applied to anti-TNF response

Affiliation

Convergent Random Forest predictor: methodology for predicting drug response from genome-scale data applied to anti-TNF response

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical

Molecular Biology Databases