State of the Field in Multi-Omics Research: From Computational Needs to Data Mining and Sharing

Michal Krassowski¹, Vivek Das², Sangram K Sahu³, Biswapriya B Misra⁴

Affiliations

¹ Nuffield Department of Women's & Reproductive Health, University of Oxford, Oxford, United Kingdom.
² Novo Nordisk Research Center Seattle, Inc, Seattle, WA, United States.
³ Independent Researcher, Bengaluru, India.
⁴ Independent Researcher, Namburu, India.

PMID: 33362867
PMCID: PMC7758509
DOI: 10.3389/fgene.2020.610798

Review

State of the Field in Multi-Omics Research: From Computational Needs to Data Mining and Sharing

Michal Krassowski et al. Front Genet. 2020.

. 2020 Dec 10:11:610798.

doi: 10.3389/fgene.2020.610798. eCollection 2020.

Authors

Michal Krassowski¹, Vivek Das², Sangram K Sahu³, Biswapriya B Misra⁴

Affiliations

¹ Nuffield Department of Women's & Reproductive Health, University of Oxford, Oxford, United Kingdom.
² Novo Nordisk Research Center Seattle, Inc, Seattle, WA, United States.
³ Independent Researcher, Bengaluru, India.
⁴ Independent Researcher, Namburu, India.

PMID: 33362867
PMCID: PMC7758509
DOI: 10.3389/fgene.2020.610798

Abstract

Multi-omics, variously called integrated omics, pan-omics, and trans-omics, aims to combine two or more omics data sets to aid in data analysis, visualization and interpretation to determine the mechanism of a biological process. Multi-omics efforts have taken center stage in biomedical research leading to the development of new insights into biological events and processes. However, the mushrooming of a myriad of tools, datasets, and approaches tends to inundate the literature and overwhelm researchers new to the field. The aims of this review are to provide an overview of the current state of the field, inform on available reliable resources, discuss the application of statistics and machine/deep learning in multi-omics analyses, discuss findable, accessible, interoperable, reusable (FAIR) research, and point to best practices in benchmarking. Thus, we provide guidance to interested users of the domain by addressing challenges of the underlying biology, giving an overview of the available toolset, addressing common pitfalls, and acknowledging current methods' limitations. We conclude with practical advice and recommendations on software engineering and reproducibility practices to share a comprehensive awareness with new researchers in multi-omics for end-to-end workflow.

Keywords: FAIR; benchmarking; data heterogeneity; integrated omics; machine learning; multi-omics; reproducibility; visualization.

PubMed Disclaimer

Conflict of interest statement

VD currently works as a Post-Doctoral Researcher in Novo Nordisk Research Center Seattle, Inc. He did not receive any funding for this work. BBM works as a Computational Biologist in Enveda Therapeutics and did not receive any funding for this work. SS has no conflicts of interest. MK has no financial conflicts of interest, but he contributed to two of the discussed projects: rpy2 and Jupyter.

Figures

**FIGURE 1**
The complexity of multi-omics: merger of omics-driven biology, data science, informatics, statistics, and computational sciences.

**FIGURE 2**
Example of complexity and interconnectivity of omics data sources in a multi-omics framework. A simple cellular endogenous metabolite, lactate is biosynthesized enzymatically from pyruvate (another metabolite) with the help of lactate dehydrogenase (LDHA, a catalytic protein). In turn this LDHA can interact with several known and unknown proteins through protein-protein interactions to regulate its own function, and itself is subjected to diverse post-translational modifications (PTMs) that regulate its catalytic function. Lactate measurement through techniques such as *in vivo* brain imaging in human or other model animals can generate lactate’s spatial distribution. Gut microbiome via Lactobacillus and other microbes can synthesize lactate and release into human physiological systems to contribute to lactate levels. Lactate biosynthesis regulation can be due to various levels of genetic (e.g., SNPs, CNV, etc.), transcriptomic, post-transcriptomic (e.g., miRNA) and/or epigenetics (e.g., DNA methylation) changes on the LDHA gene. Though this is one of the well-studied set of multi-omics interactions, but one can expect more complex and unknown interactions while integrating multi-omics datasets.

**FIGURE 3**
Flow diagram of best practice guideline in a multi-omics study for FAIR sharing. A multi-omic study entails data varied assays/sources/omics type, that can be integrated using various framework and tools. This process (represented in block with light-green) can be computationally intensive. As a by-product we get processed data, which can be taken forward to do multiple steps involving exploration, inferencing and interpretations. Sharing both the data and code alongside compute environment allows interoperability and non-reinventing the wheel. Here (represented in third block with light-purple), describes the open sharing of different components in a multi-omic project, the connected blocks that can eventually generate reproducible results in forms of reports for users.

**FIGURE 4**
A systematic flow diagram to screen multi-omics literature in PubMed indexes articles (up to July 2020). This flow diagram represents the various steps of inclusion and exclusion criteria used to identify varied characteristics and attributes associated with published multi-omics studies. A detailed self-explanatory method with reproducible code are available at https://github.com/krassowski/multi-omics-state-of-the-field.

**FIGURE 5**
Characterization of multi-omics literature based on a systematic screen of PubMed indexed articles (up to July 2020). **(A)** Combinations of omics (grouped by the characterized entities) commonly discussed occurring together in multi-omics articles (intersections with ≥ 3 omics and at least 50 papers). *The proteins group (1) also includes peptides; the metabolites group (2) includes other endogenous molecules; the epigenetic group (3) encompasses all epigenetic modifications.* **(B)** Trend plot representing the rapidly increasing number of multi-omics articles indexed in PubMed (also after adjusting for the number of articles published in matched journals – data not shown); the dip in 2020 can be attributed to indexing delay which was not accounted for in the current plot. **(C)** Distribution of article categories that mention different numbers of omics; while it is understandable that multi-omics “Review” category discusses many omics, the “Computational method” category articles appear to lag all other article category types. The detected number of omics may underestimate the actual numbers (due to the automated search strategy) but should put a useful lower bound on the number of omics discussed. Bootstrapped 95% confidence intervals around the mean are presented with the whiskers. (D) The number of articles mentioning the most popular clinical findings, disease terms (here screening is based on ClinVar diseases list) and species (based upon NCBI Taxonomy database). Both databases were manually filtered down to remove ambiguous terms and merge plural/singular forms. Only the abstracts were screened here. (E) The detected references to code, data versioning, distribution platforms and systems (links to repositories with deposited code/data); both the abstracts and full-texts (open-access subset, 44% of all articles) were screened. No manual curation to classify intent of the link inclusion (i.e., to share authors’ code/data vs. to report the use of a dataset/tool) was undertaken. The details of the methods with reproducible code are available at github.com/krassowski/multi-omics-state-of-the-field. The comprehensive search terms (see the online repository for details) were collapsed into four categories; integrated omics (*) includes integromics and integrative omics, multi-view (**) includes multi-view| block| source| modal omics, other terms (***) include pan-, trans-, poly-, cross-omics.

See this image and copyright information in PMC

References

1. Amodio M., Krishnaswamy S. (2018). “MAGAN: aligning biological manifolds,” in 35th International Conference on Machine Learning ICML 2018, Vol. 1 Stockholm, 327–335.
1. Amstutz P., Chapman B., Chilton J., Heuer M., Stojanovic E. (2016). Common Workflow Language, v1.0 Common Workflow Language (CWL) Command Line Tool Description, v1.0. 10.6084/m9.figshare.3115156.v2 - DOI
1. Argelaguet R., Arnol D., Bredikhin D., Deloro Y., Velten B., Marioni J. C., et al. (2020). MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 21:111 10.1186/s13059-020-02015-1 - DOI - PMC - PubMed
1. Argelaguet R., Velten B., Arnol D., Dietrich S., Zenz T., Marioni J. C., et al. (2018). Multi-Omics Factor Analysis—a framework for unsupervised integration of multi-omics data sets. Mol. Syst. Biol. 14:e8124. 10.15252/msb.20178124 - DOI - PMC - PubMed
1. BACnet Stack (2020). BACnet Stack. Available online at: https://github.com/bacnet-stack (accessed August 3, 2020).

Publication types

Actions

LinkOut - more resources

Full Text Sources
Medical
- ClinicalTrials.gov

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

State of the Field in Multi-Omics Research: From Computational Needs to Data Mining and Sharing

Affiliations

State of the Field in Multi-Omics Research: From Computational Needs to Data Mining and Sharing

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources

Medical