. 2016 Nov 4;17(1):874.

doi: 10.1186/s12864-016-3198-9.

Mergeomics: multidimensional data integration to identify pathogenic perturbations to biological systems

Le Shu¹, Yuqi Zhao¹, Zeyneb Kurt¹, Sean Geoffrey Byars^{2

3}, Taru Tukiainen⁴, Johannes Kettunen⁴, Luz D Orozco⁵, Matteo Pellegrini⁵, Aldons J Lusis⁶, Samuli Ripatti⁴, Bin Zhang⁷, Michael Inouye^{2

3

8}, Ville-Petteri Mäkinen^{9

10

11

12}, Xia Yang^{13

14}

Affiliations

¹ Department of Integrative Biology and Physiology, University of California, Los Angeles, Los Angeles, CA, USA.
² Center for Systems Genomics, University of Melbourne, Melbourne, Australia.
³ School of BioSciences, University of Melbourne, Melbourne, Australia.
⁴ Institute for Molecular Medicine, Helsinki, Finland.
⁵ Department of Molecular, Cell and Developmental Biology, University of California, Los Angeles, Los Angeles, CA, USA.
⁶ Department of Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA.
⁷ Department of Genetics and Genomics Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁸ Department of Pathology, University of Melbourne, Melbourne, Australia.
⁹ Department of Integrative Biology and Physiology, University of California, Los Angeles, Los Angeles, CA, USA. ville-petteri.makinen@sahmri.com.
¹⁰ South Australian Health and Medical Research Institute, Adelaide, Australia. ville-petteri.makinen@sahmri.com.
¹¹ School of Biological Sciences, University of Adelaide, Adelaide, Australia. ville-petteri.makinen@sahmri.com.
¹² Computational Medicine, Faculty of Medicine, University of Oulu and Biocenter Oulu, Oulu, Finland. ville-petteri.makinen@sahmri.com.
¹³ Department of Integrative Biology and Physiology, University of California, Los Angeles, Los Angeles, CA, USA. xyang123@ucla.edu.
¹⁴ Insitute for Quantitative and Computational Biosciences, University of California, Los Angeles, Los Angeles, CA, USA. xyang123@ucla.edu.

PMID: 27814671
PMCID: PMC5097440
DOI: 10.1186/s12864-016-3198-9

Mergeomics: multidimensional data integration to identify pathogenic perturbations to biological systems

Le Shu et al. BMC Genomics. 2016.

. 2016 Nov 4;17(1):874.

doi: 10.1186/s12864-016-3198-9.

Authors

Affiliations

¹ Department of Integrative Biology and Physiology, University of California, Los Angeles, Los Angeles, CA, USA.
² Center for Systems Genomics, University of Melbourne, Melbourne, Australia.
³ School of BioSciences, University of Melbourne, Melbourne, Australia.
⁴ Institute for Molecular Medicine, Helsinki, Finland.
⁵ Department of Molecular, Cell and Developmental Biology, University of California, Los Angeles, Los Angeles, CA, USA.
⁶ Department of Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA.
⁷ Department of Genetics and Genomics Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁸ Department of Pathology, University of Melbourne, Melbourne, Australia.
⁹ Department of Integrative Biology and Physiology, University of California, Los Angeles, Los Angeles, CA, USA. ville-petteri.makinen@sahmri.com.
¹⁰ South Australian Health and Medical Research Institute, Adelaide, Australia. ville-petteri.makinen@sahmri.com.
¹¹ School of Biological Sciences, University of Adelaide, Adelaide, Australia. ville-petteri.makinen@sahmri.com.
¹² Computational Medicine, Faculty of Medicine, University of Oulu and Biocenter Oulu, Oulu, Finland. ville-petteri.makinen@sahmri.com.
¹³ Department of Integrative Biology and Physiology, University of California, Los Angeles, Los Angeles, CA, USA. xyang123@ucla.edu.
¹⁴ Insitute for Quantitative and Computational Biosciences, University of California, Los Angeles, Los Angeles, CA, USA. xyang123@ucla.edu.

PMID: 27814671
PMCID: PMC5097440
DOI: 10.1186/s12864-016-3198-9

Abstract

Background: Complex diseases are characterized by multiple subtle perturbations to biological processes. New omics platforms can detect these perturbations, but translating the diverse molecular and statistical information into testable mechanistic hypotheses is challenging. Therefore, we set out to create a public tool that integrates these data across multiple datasets, platforms, study designs and species in order to detect the most promising targets for further mechanistic studies.

Results: We developed Mergeomics, a computational pipeline consisting of independent modules that 1) leverage multi-omics association data to identify biological processes that are perturbed in disease, and 2) overlay the disease-associated processes onto molecular interaction networks to pinpoint hubs as potential key regulators. Unlike existing tools that are mostly dedicated to specific data type or settings, the Mergeomics pipeline accepts and integrates datasets across platforms, data types and species. We optimized and evaluated the performance of Mergeomics using simulation and multiple independent datasets, and benchmarked the results against alternative methods. We also demonstrate the versatility of Mergeomics in two case studies that include genome-wide, epigenome-wide and transcriptome-wide datasets from human and mouse studies of total cholesterol and fasting glucose. In both cases, the Mergeomics pipeline provided statistical and contextual evidence to prioritize further investigations in the wet lab. The software implementation of Mergeomics is freely available as a Bioconductor R package.

Conclusion: Mergeomics is a flexible and robust computational pipeline for multidimensional data integration. It outperforms existing tools, and is easily applicable to datasets from different studies, species and omics data types for the study of complex traits.

Keywords: Blood glucose; Cholesterol; Functional genomics; Gene networks; Integrative genomics; Key drivers; Mergeomics; Multidimensional data integration.

PubMed Disclaimer

Figures

**Fig. 1**
Main modules, data flow between them and examples of data types that can be integrated by Mergeomics

**Fig. 2**
Schematic illustration of the concept of a key driver gene (a) and local hubs with overlapping neighborhoods (b)

**Fig. 3**
Comparison of three pathway enrichment methods across three GWAS. Performance is evaluated by sensitivity (a), specificity (b), positive likelihood ratio (sensitivity/(1-specificity)) (c) and receiver operating characteristic curve (d–f). Sensitivity was defined as the proportion of positive control pathways detected at FDR < 25 %. Specificity was defined as the proportion of negative controls rejected at FDR ≥ 25 %. Error bars denote the standard error of simulation results

**Fig. 4**
Comparison of performance of SNP-level meta-analysis and pathway-level meta-analysis using simulated gene-sets. Results are produced in the same workflow as stated in Table 1. a Sensitivity. b Specificity. c Positive likelihood ratio (Sensitivity/(1-Specificity)). d Receiver operating characteristic curve. Error bars denote the standard error of simulation results

**Fig. 5**
Performance comparison between wKDA and the unweighted key driver analysis. Two empirical subnetworks (Lipid I & II) were obtained from a previous publication [23], and a canonical metabolism of lipids and lipoproteins pathway was obtained from the Reactome database (R-HSA-556833). The methods were tested by projecting the three functional subnetworks onto two independent adipose networks (a–c) and two independent liver regulatory networks (d–f). The adipose and liver networks were constructed from a collection of Bayesian tissue-specific network models (Additional file 1: Table S3). Overlap between the tissue-specific key driver signals across two independent regulatory networks was defined according to the Jaccard index. Overlap ratio was calculated for both original networks and networks with 25, 50, 75 or 100 % rewiring of edges

**Fig. 6**
Visualization of adipose (a) and liver (b) networks around top key drivers that were identified for cholesterol-associated subnetworks. Top key drivers (nodes with the largest size) are selected as the top five independent key regulatory genes (genes whose neighbourhood has less than 25 % overlap with the neighbourhood of other independent hubs) for subnetwork 2 and subnetwork 6. Subnetwork member genes are denoted as medium size nodes and non-member genes as small size nodes. Top co-hubs (co-hubs with FDR < 10⁻¹⁰ in wKDA) are highlighted by yellow circles. Only edges that were supported by at least two studies are drawn

See this image and copyright information in PMC

References

1. Hunter DJ. Gene-environment interactions in human diseases. Nat Rev Genet. 2005;6(4):287–298. doi: 10.1038/nrg1578. - DOI - PubMed
1. Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, Hao L, Kiang A, Paschall J, Phan L, et al. The NCBI dbGaP database of genotypes and phenotypes. Nat Genet. 2007;39(10):1181–1186. doi: 10.1038/ng1007-1181. - DOI - PMC - PubMed
1. Barrett T, Edgar R. Gene expression omnibus: microarray data storage, submission, retrieval, and analysis. Methods Enzymol. 2006;411:352–369. doi: 10.1016/S0076-6879(06)11019-8. - DOI - PMC - PubMed
1. Parkinson H, Kapushesky M, Shojatalab M, Abeygunawardena N, Coulson R, Farne A, Holloway E, Kolesnykov N, Lilja P, Lukk M, et al. ArrayExpress--a public database of microarray experiments and gene expression profiles. Nucleic Acids Res. 2007;35(Database issue):D747–D750. doi: 10.1093/nar/gkl995. - DOI - PMC - PubMed
1. Consortium EP, Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447(7146):799–816. doi: 10.1038/nature05874. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

R01 DK104363/DK/NIDDK NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Mergeomics: multidimensional data integration to identify pathogenic perturbations to biological systems

Affiliations

Mergeomics: multidimensional data integration to identify pathogenic perturbations to biological systems

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous