Lean Big Data integration in systems biology and systems pharmacology

Avi Ma'ayan¹, Andrew D Rouillard², Neil R Clark², Zichen Wang², Qiaonan Duan², Yan Kou²

Affiliations

¹ Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, Systems Biology Center New York (SBCNY), One Gustave L. Levy Place, Box 1215, New York, NY 10029, USA. Electronic address: avi.maayan@mssm.edu.
² Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, Systems Biology Center New York (SBCNY), One Gustave L. Levy Place, Box 1215, New York, NY 10029, USA.

PMID: 25109570
PMCID: PMC4153537
DOI: 10.1016/j.tips.2014.07.001

Lean Big Data integration in systems biology and systems pharmacology

Avi Ma'ayan et al. Trends Pharmacol Sci. 2014 Sep.

. 2014 Sep;35(9):450-60.

doi: 10.1016/j.tips.2014.07.001. Epub 2014 Aug 7.

Authors

Avi Ma'ayan¹, Andrew D Rouillard², Neil R Clark², Zichen Wang², Qiaonan Duan², Yan Kou²

Affiliations

¹ Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, Systems Biology Center New York (SBCNY), One Gustave L. Levy Place, Box 1215, New York, NY 10029, USA. Electronic address: avi.maayan@mssm.edu.
² Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, Systems Biology Center New York (SBCNY), One Gustave L. Levy Place, Box 1215, New York, NY 10029, USA.

PMID: 25109570
PMCID: PMC4153537
DOI: 10.1016/j.tips.2014.07.001

Abstract

Data sets from recent large-scale projects can be integrated into one unified puzzle that can provide new insights into how drugs and genetic perturbations applied to human cells are linked to whole-organism phenotypes. Data that report how drugs affect the phenotype of human cell lines and how drugs induce changes in gene and protein expression in human cell lines can be combined with knowledge about human disease, side effects induced by drugs, and mouse phenotypes. Such data integration efforts can be achieved through the conversion of data from the various resources into single-node-type networks, gene-set libraries, or multipartite graphs. This approach can lead us to the identification of more relationships between genes, drugs, and phenotypes as well as benchmark computational and experimental methods. Overall, this lean 'Big Data' integration strategy will bring us closer toward the goal of realizing personalized medicine.

Keywords: data integration; network analysis; network pharmacology; side-effect prediction; systems pharmacology; target prediction.

PubMed Disclaimer

Figures

**Fig. 1**
Over 40 large-scale projects, databases and other resources can be integrated in a coherent map to link human and mouse phenotypes to drugs, genes, expression signatures, protein-protein interaction networks and gene-gene functional association networks. This map is helpful for assembling the puzzle pieces needed to significantly enhance our understanding of the effects of drugs on the individual human phenotype at the molecular level. This diagram brings various datasets and resources together for the purpose of identifying non-trivial associations and relationship enabling discovery and assisting in converting data to knowledge.

**Fig. 2**
Eighty eight drugs were applied at four doses: 0.08μm, 0.4μm, 2μm and 10μm to MCF10A breast cancer cell-line and then gene expression changes were measured at two time points: 6 and 24 hours, resulting in 352 experiments in total. The gene expression values of 978 landmark genes responding to these treatment conditions and control experiments were measured using the L1000 platform. Characteristic direction analysis (33) was applied on the normalized gene expression values to calculate a gene expression signature for each experiment. A gene expression signature is a vector made of the 978-genes occupying 978-dimensional-space whose direction represents in which way the experiment deviates from control and whose norm represents strength of the experiment. The strengths were compared to a null distribution so that each gene expression signature was assigned a z-score to quantify the strength of each perturbation. Only those highly significant experiments whose z-scores are greater than 5 were reserved for visualization. The gene expression vectors of these highly significant experiments were transformed into the first three principal components using PCA analysis and then plotted. The directions of the perturbation vectors fall into four groups (I–IV). Examining the labels of these experiments, we observe a time-dependency: experiments of different time points do not fall into the same group. The drug targets of the visualized experiments are also grouped. Drugs in groups I and IV target kinases within the growth factor activated pathways, whereas drugs in group II and III mainly targeting CDKs.

**Fig. 3**
Resources from systems biology and systems pharmacology can be integrated by first identifying the various objects, their relations, and their data types, and then converting the data into single entity weighted networks, fuzzy set libraries, or weighted multi-partite graphs.

**Fig. 4**
Computing gene set overlap between a gene-set library made of up-regulated genes in individual tumors from breast cancer patients from TCGA with a gene set library created by identifying the putative transcription factor target genes from ChIP-seq experiments conducted by ENCODE. Clusters of gene set that significantly overlap (brown spots) are labeled based on the most common transcription factor/s identified within the cluster. To create the breast cancer patients gene set library, Affymetrix microarray gene expression data from 536 breast cancer patients were downloaded from the TCGA. Average expression was computed for genes with multiple probes. For each gene, a Z score was computed by take the average and standard deviation across all patients. Genes that are highly and significantly expressed in each tumor were retained for constructing the gene set library containing 536 patients as the labels and the genes that passed the threshold (p<0.01) as the gene sets associated with each patient. To create the ENCODE gene set library 920 experiments applied to 44 cell lines profiling 160 transcription factors were processed. We retain the target genes that had significant peaks within +2k bp of their transcription starting site (TSS). Since most experiments have replicates, we only kept genes identified in both replicates. The ENCODE gene-set library contains 434 unique experiments. Gene set overlap was computed using the Fisher exact test. The hierarchical clustering plot was created with MATLAB using the Bioinformatics ToolBox.

**Fig. 5**
Tri-partite network that integrates gene expression data from cancer cell-lines and patient tumors with drug response data for cancer cell lines. The network connects groups of patients, cell-lines, and drugs to suggest drugs for patients. Edges between patient groups and cell-lines are colored based on higher (red) or lower (green) expression correlation. Edges between cell-lines and drugs are colored based on higher (magenta/purple) to lower (cyan) drug sensitivity.

**Fig. 6**
Genes were ranked according to their significance of differential expression after transcription factor perturbation and this was scaled such that the most significant gene received a scaled rank of r=0 and the least significant gene has r=1. The scaled ranks of all genes associated with binding sites of the perturbed transcription factor in ENCODE experiments were identified and this process was repeated over 73 experiments in which transcription factor perturbations were followed by gene expression profiling, and the cumulative distribution D(r) was calculated. After subtracting the expected cumulative distribution in the cases of a random uniform distribution, corresponding to the null hypothesis of no enrichment of the genes associated with the perturbed transcription factor from the ENCODE data, we plot the cumulative distributions for each of the five differential expression approaches. In order to indicate the significance of the deviations form a uniform random distribution we also indicate the values scaled by the expected standard deviation ψ. Note that the greater the peak of the resulting curve, the greater priority the method assigns to genes associated with the perturbed transcription factor in independent ENCODE experiments.

**Fig. 7**
Two drug-drug similarity networks, one based on chemical structure and the other based on gene expression signatures, are used to predict a drug-drug similarity network that is based on shared side-effects. The R library ChemmineR was used to create the drug-drug similarity network based on structure (38). The simplified molecular-input line-entry system (SMILES) strings of 1,409 FDA-approved drugs were converted to a binary string representing 166 Molecular ACCess System (MACCS) structural elements of the drugs. Then, to create a drug-drug similarity network, Jaccard index was used to measure the overlap between shared structural elements for each pair of drugs. Gene expression data from the LINCS L1000 project were obtained from the CPC and CPD batches by selecting the most significant perturbation amongst all dosages, time points, and cell types for each compounds treatment based on the signature strength as defined by the documentation on the lincscloud.org web-site (‘distil_ss’ value). To create a drug-drug similarity network, the Pearson’s correlation coefficient was computed between all pairs of drugs applied to the expression values of the 978 landmark genes. The side-effect drug-drug similarity network was created from SIDER by first creating a gene-set library where the drugs are the terms and the side-effects are the set elements. The Sets2Networks algorithm (30) was used to compute similarity between drugs. The interactions between drugs were sorted and ROC curves were plotted based on matched interaction. To combine the L1000 and MACCS scores, the normalized scores were simply added.

See this image and copyright information in PMC

References

1. Jenkins SL, Ma’ayan A. Systems pharmacology meets predictive, preventive, personalized and participatory medicine. Pharmacogenomics. 2013;14:119–122. - PMC - PubMed
1. Berger SI, Iyengar R. Network analyses in systems pharmacology. Bioinformatics. 2009;25:2466–2472. - PMC - PubMed
1. Mayer-Schönberger V, Cukier K. Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt; 2013.
1. Chibon F. Cancer gene expression signatures–The rise and fall? European Journal of Cancer. 2013;49:2000–2009. - PubMed
1. Stegmaier K, Ross KN, Colavito SA, O’Malley S, Stockwell BR, Golub TR. Gene expression–based high-throughput screening (GE-HTS) and application to leukemia differentiation. Nature genetics. 2004;36:257–263. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Lean Big Data integration in systems biology and systems pharmacology

Affiliations

Lean Big Data integration in systems biology and systems pharmacology

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources