Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jan 15;27(2):220-4.
doi: 10.1093/bioinformatics/btq628. Epub 2010 Dec 5.

Predicting in vitro drug sensitivity using Random Forests

Affiliations

Predicting in vitro drug sensitivity using Random Forests

Gregory Riddick et al. Bioinformatics. .

Abstract

Motivation: Panels of cell lines such as the NCI-60 have long been used to test drug candidates for their ability to inhibit proliferation. Predictive models of in vitro drug sensitivity have previously been constructed using gene expression signatures generated from gene expression microarrays. These statistical models allow the prediction of drug response for cell lines not in the original NCI-60. We improve on existing techniques by developing a novel multistep algorithm that builds regression models of drug response using Random Forest, an ensemble approach based on classification and regression trees (CART).

Results: This method proved successful in predicting drug response for both a panel of 19 Breast Cancer and 7 Glioma cell lines, outperformed other methods based on differential gene expression, and has general utility for any application that seeks to relate gene expression data to a continuous output variable.

Implementation: Software was written in the R language and will be available together with associated gene expression and drug response data as the package ivDrug at http://r-forge.r-project.org.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Overview of the model building algorithm. (A) A RANDOM FOREST model is fit between all the probesets in the training set (16 644) and the IC50 values for each drug. (B) PROBESETS that have a variable importance OF 2 SDs > mean of variable importance for all probesets are kept as a gene expression signature; a second Random Forest model is fit between this gene expression signature and the IC50 values for each drug. (C) CASE proximity values for each drug are generated from the second model using Equation (1), outlying cell lines are removed, and a third Random Forest model is fit with the remaining cell lines and the gene expression signature.
Fig. 2.
Fig. 2.
Pairwise proximity matrixes for pepleomycin and simvastatin. Proximity matrices from Random Forest are defined as the number of instances in which two cases (cell lines) are assigned to the same terminal node of a tree, normalized over the [0, 1] interval. Proximity between a case and itself is not a meaningful value so these instances on the diagonal are set to zero (A1) proximity matrix for pepleomycin before reduction of cell-lines by Equation (1). (A2) Proximity matrix for pepleomycin after removal of outlying cell lines. (B1) Proximity matrix for simvastatin. (B2) Proximity matrix for simvastatin after removal of outlying cell lines.
Fig. 3.
Fig. 3.
Experimental confirmation of predictions for simvastatin and pepleomycin in 19 breast cancer cell lines. CELL lines were identified as resistant if showing −log(IC50) <4 and as sensitive if showing > 5.4−log(IC50) for simvastatin (A) and −log(IC50) <4 and −log(IC50)> 5.4 for pepleomycin (B). Y-axis shows the predicted IC50. Sensitive and resistant groups for the TWO-step method showed statistically significant differences in means using a two-tailed t-test (P < 0.05). The two-step method produced a greater separation of means (0.17, 0.40) versus (0.20, 0.30) for simvastatin and (0.28, 0.51) versus (0.31, 0.43) for pepleomycin.
Fig. 4.
Fig. 4.
Experimental confirmation of predictions for 40 FDA-approved cancer drugs in seven glioma cell lines. FOR each drug, the mean of predicted IC50 response over the seven cell lines was computed. The percent viability of cell lines relative to a control was measured at 50 and 500 nm of drug concentration after growth of 48 h normalized over the [0, 1] interval. A two-tailed significance test (correlation test in R) of the Pearson product moment correlation between predicted and measured IC50 values across all 37 cell lines showed significance for both concentration points at P < 0.001.
Fig. 5.
Fig. 5.
Performance evaluation of the two-step algorithm. (A) THE two-step method successfully created 37 signatures from the 40 FDA-approved drugs while the signature generation based on differential gene expression produced 17 and differential gene expression + co-expression extrapolation produced 14. (B) SCATTER plot of two-step algorithm predictions for 37 drugs versus measured IC50 values. (C) Scatter plot of two-step algorithm predictions for 14 drugs. (D) Scatter plot of predicted versus actual IC50 values for the same 14 drugs predicted using the co-expression extrapolation method.

References

    1. Breiman L. Random Forests. Mach. Learn. 2001;45:5–32.
    1. Covell DG, et al. Anticancer medicines in development: assessment of bioactivity profiles within the National Cancer Institute anticancer screening data. Mol. Cancer Therap. 2007;6:2261–2270. - PubMed
    1. Diaz-Uriarte R, Alvarez de Andres S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006;7:3. - PMC - PubMed
    1. Kutalik Z, et al. A modular approach for integrative analysis of large-scale gene-expression and drug-response data. Nat. Biotechnol. 2008;26:531–539. - PubMed
    1. Liaw A, Wiener M. Classification and regression by randomforest. R News. 2002;2:18–22.

Publication types

Substances