Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jul:1:1-27.
doi: 10.1146/annurev-biodatasci-080917-013350. Epub 2018 Apr 25.

Big Data Approaches for Modeling Response and Resistance to Cancer Drugs

Affiliations

Big Data Approaches for Modeling Response and Resistance to Cancer Drugs

Peng Jiang et al. Annu Rev Biomed Data Sci. 2018 Jul.

Abstract

Despite significant progress in cancer research, current standard-of-care drugs fail to cure many types of cancers. Hence, there is an urgent need to identify better predictive biomarkers and treatment regimes. Conventionally, insights from hypothesis-driven studies are the primary force for cancer biology and therapeutic discoveries. Recently, the rapid growth of big data resources, catalyzed by breakthroughs in high-throughput technologies, has resulted in a paradigm shift in cancer therapeutic research. The combination of computational methods and genomics data has led to several successful clinical applications. In this review, we focus on recent advances in data-driven methods to model anticancer drug efficacy, and we present the challenges and opportunities for data science in cancer therapeutic research.

Keywords: big data; combination therapy; drug resistance; immunotherapy; precision medicine; response biomarker; toxicity.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Data-driven approaches for modeling cancer therapy efficacy. Most data-driven studies of anticancer drug efficacy involve four components: genomics technology, experimental model, computational method, and clinical application. The use of genomics technology in experimental models generates data that can be analyzed by computational methods to generate results for clinical applications. (a) Microarray and high-throughput sequencing are widely used to study the DNA alterations and RNA transcriptomes in cancer samples. Genetics screens through RNAi or CRISPR technologies can study the effect of perturbing a gene in a cell line model (174). Compound screens based on automation frameworks can test the efficacy of many drugs on a cell line panel (29, 35, 36). (b) The most clinically relevant system is human, where both tumor microenvironment (10, 12) and gut microbiota (17) can determine anticancer drug efficacy. However, genetic experiments cannot be directly applied to humans, so mouse models are used as alternatives to study in vivo factors of drug response (43, 175, 176). Cancer cell lines are the most widely used research models. Cell lines can be cultured alone or cocultured either between cancer and immune cells (–48) or between immune and bacteria cells (64, 69). (c) Most data analyses involve variable selection. Molecular alterations of genes across samples are input variables, and drug efficacy is the outcome (84). Variable selection methods can identify critical genes associated with anticancer drug efficacy. Clustering algorithms can be applied to identify patterns in a data set (115). Mathematical (97, 100) or network models (107) can be applied to explore the properties and mechanisms of a molecular circuit that mediate anticancer drug efficacy. (d) Many studies are designed to find biomarkers for therapy response prediction (177) or side effects (–136) in clinical applications using the molecular profiles of patient samples. Data-driven models can also be applied to identify synergistic drug combinations to treat specific cancers (84). Abbreviations: CRISPR, clustered regularly interspaced short palindromic repeats; NK, natural killer; MDSC, myeloid-derived suppressor cell; MΦ, macrophage; oligo, oligonucleotide; RNAi, RNA interference.
Figure 2
Figure 2
Compound screening in cancer cell lines. Automation frameworks can be utilized to test the growth inhibition effects of a library of compounds across many cancer cell lines with diverse genetic backgrounds. Most compound screen projects also profiled the molecular features (e.g., gene expression, copy number, mutation status) of cell lines. The final data output is the growth inhibition effects of compounds on cell lines, together with cell line molecular profiles.
Figure 3
Figure 3
Variable selection in high-dimensional data. (a) Three common relationships between variable matrices (X) and outcomes (Y). (b) The unified framework of linear models y ~ g(Xβ) for n samples and p variables (for p > n), variable matrix X = n × p, and coefficient vector β = p. The number of samples n may range from 10 to 1,000 in most studies, representing the number of profiled patients. The number of variables p is about 20,000 in most studies, representing the number of human genes. (c) High-dimensional regression through regularization. The coefficients of most high-dimensional regressions can be solved under a unified framework of minimizing the objective function f (β) together with a combination of L1 (LASSO) and L2 (ridge) penalties (where λ1, λ2 ≥ 0). The objective function of linear regression is the sum of least squares across all samples. The objective functions of logistic and Cox-PH regressions are the negative log of the likelihood function L(β, y, X). (d) High-dimensional regression through stepwise forward selection. At each step, the best variable is selected from a candidate pool to minimize the model error, such as cross-validation error. The procedure will terminate if any further variable selection increases the model error. Some previously selected variables may become insignificant during the stepwise process and get removed from the model. Abbreviations: Cox-PH, Cox proportional hazard; LASSO, least absolute shrinkage and selection operator.
Figure 4
Figure 4
Biomarker training using clinical and cell line data. (a) The training of a multigene biomarker to guide treatment decisions starts from a collection of tumor genomics profiles paired with the patients’ clinical outcomes. The association between gene profiles and patients’ clinical outcomes is tested by statistical models, and a subset of genes are selected through a cross-validation procedure to optimize prediction accuracy. The accuracy of the gene biomarker will be evaluated in clinical trials for Food and Drug Administration approval or commercialization. (b) Computational methods can identify response biomarkers from compound screen data. Statistical methods can identify genes whose molecular status is significantly associated with drug efficacy across screened cell lines. The identified biomarker could be a subset of genes or a genome-wide vector of scores with one value per gene. In the latter case, the therapy response of each patient could be predicted by correlating between tumor gene expression values and biomarker scores.

Similar articles

Cited by

References

    1. Huang ME, Ye YC, Chen SR, Chai JR, Lu JX, et al. 1988. Use of all-trans retinoic acid in the treatment of acute promyelocytic leukemia. Blood 72:567–72 - PubMed
    1. Deininger M, Buchdunger E, Druker BJ. 2005. The development of imatinib as a therapeutic agent for chronic myeloid leukemia. Blood 105:2640–53 - PubMed
    1. Paez JG, Janne PA, Lee JC, Tracy S, Greulich H, et al. 2004. EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science 304:1497–500 - PubMed
    1. Solomon BJ, Mok T, Kim DW, Wu YL, Nakagawa K, et al.2014. First-line crizotinib versus chemotherapy in ALK-positive lung cancer. New Engl. J. Med. 371:2167–77 - PubMed
    1. Holohan C, Van Schaeybroeck S, Longley DB,Johnston PG. 2013. Cancer drug resistance: an evolving paradigm. Nat. Rev. Cancer 13:714–26 - PubMed

LinkOut - more resources