Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Jan 9:2024.01.09.574780.
doi: 10.1101/2024.01.09.574780.

PathIntegrate: Multivariate modelling approaches for pathway-based multi-omics data integration

Affiliations

PathIntegrate: Multivariate modelling approaches for pathway-based multi-omics data integration

Cecilia Wieder et al. bioRxiv. .

Update in

Abstract

As terabytes of multi-omics data are being generated, there is an ever-increasing need for methods facilitating the integration and interpretation of such data. Current multi-omics integration methods typically output lists, clusters, or subnetworks of molecules related to an outcome. Even with expert domain knowledge, discerning the biological processes involved is a time-consuming activity. Here we propose PathIntegrate, a method for integrating multi-omics datasets based on pathways, designed to exploit knowledge of biological systems and thus provide interpretable models for such studies. PathIntegrate employs single-sample pathway analysis to transform multi-omics datasets from the molecular to the pathway-level, and applies a predictive single-view or multi-view model to integrate the data. Model outputs include multi-omics pathways ranked by their contribution to the outcome prediction, the contribution of each omics layer, and the importance of each molecule in a pathway. Using semi-synthetic data we demonstrate the benefit of grouping molecules into pathways to detect signals in low signal-to-noise scenarios, as well as the ability of PathIntegrate to precisely identify important pathways at low effect sizes. Finally, using COPD and COVID-19 data we showcase how PathIntegrate enables convenient integration and interpretation of complex high-dimensional multi-omics datasets. The PathIntegrate Python package is available at https://github.com/cwieder/PathIntegrate.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:. Pathway transformation enhances sensitivity to low signal-to-noise signals.
y axis shows proportion of MWU tests significant at Bonferroni p ≤ 0.05, performed either on the pathway-level data or the molecular level data, at varying effect sizes shown on x-axis. Semi-synthetic data based on COVID-19 dataset.
Figure 2:
Figure 2:. PathIntegrate Multi-View (left) and Single-View (right) modelling frameworks for multi-omics pathway-based integration.
Frameworks are outlined in terms of their input data, pathway-transformation stage, statistical model, and outputs. Blue data blocks represent omics data which has been transformed from the molecular (XN×M) space to the pathway (AN×P) space using ssPA. Both Single-View and Multi-View make use of the same multi-omics pathway set.
Figure 3:
Figure 3:. Performance of PathIntegrate and DIABLO vs. effect size, based on semi-synthetic data measured by AUROC.
COPDgene metabolomics and proteomics data were integrated in each model. a. Ability to correctly predict sample outcomes (case vs. control). We compared PathIntegrate Multi-View and Single-View to DIABLO using both molecular and pathway-level multi-omics data. b. Ability to correctly recall target enriched pathway. We compared DIABLO RGCCA model loadings to the Multi-View MB-PLS VIP and Single-View PLS VIP statistics for pathway importance. c. Comparison of PathIntegrate Multi-View classification performance using KEGG and Reactome pathway databases as well as molecular-level model. d. Effect of sample size on PathIntegrate Multi-View classification performance. For panels a-c error bars indicate 95% confidence intervals on the mean AUROC (in some cases they appear smaller than point sizes).
Figure 4:
Figure 4:. PathIntegrate Multi-View applied to COPDgene multi-omics data.
A. Superscores plot based on multi-omics (metabolomics, proteomics, and transcriptomics pathways) across four latent variables. B. Omics view importances across latent variables. Values represent mean and SEM across 100 bootstrap samples. C. Top five pathways per omics block. D. Top 15 pathways across omics blocks categorised by Reactome parent pathway. E. kPCA ssPA scores from top 15 pathways used to cluster samples using Euclidean distance and Ward linkage. F. Heatmap showing Spearman correlation between superscores across four latent variables and clinical metadata. Asterisks indicate Bonferroni p-value ≤ 0.05. Definitions of clinical variables are in Supplementary Table 2.
Figure 5:
Figure 5:. Network visualisation with PathIntegrate interactive network explorer.
PathIntegrate Multi-View was applied to COPDgene multi-omics data. A. Multi-omics network view of global Reactome hierarchy DAG. Only pathways with sufficient coverage (≥ 2 molecules per pathway) are shown as nodes. Edges represent parent-child relationships between pathways as defined by Reactome. Nodes are coloured by Reactome superpathway membership. Node size corresponds to pathway coverage. B. Network view of ‘Carnitine metabolism’ pathway (zoomed-in susbset of (a)) and close neighbourhood within the Reactome pathway hierarchy. Nodes are coloured by p-values obtained from PathIntegrate Multi-View model.
Figure 6:
Figure 6:. PathIntegrate Single-View applied to COVID-19 multi-omics data.
A. Kernel density distribution of log10 pathway sizes in the COVID dataset per omics view. Pathway size refers to the number of molecules annotated to each pathway present in the COVID datasets. B. Number of pathways with sufficient coverage in the COVID dataset in each omics view. C. Multi-omics pathway features identified using recursive feature elimination from the PathIntegrate Single-View random forest model, ranked by Gini importance. D. Molecular level importances derived from the ‘ADORA2B mediated anti-inflammatory cytokines production’ (R-HSA-9660821) SVD pathway scores. Datapoints represent mean and standard deviation of loadings of each molecule on PC1 across 200 bootstrap samples.

Similar articles

References

    1. Krassowski M., Das V., Sahu S. K. & Misra B. B. State of the Field in Multi-Omics Research: From Computational Needs to Data Mining and Sharing. Front Genet 11, 1598 (2020). - PMC - PubMed
    1. Subramanian I., Verma S., Kumar S., Jere A. & Anamika K. Multi-omics Data Integration, Interpretation, and Its Application. Bioinformatics and Biology Insights vol. 14 Preprint at 10.1177/1177932219899051 (2020). - DOI - PMC - PubMed
    1. Eicher T. et al. Metabolomics and multi-omics integration: A survey of computational methods and resources. Metabolites vol. 10 Preprint at 10.3390/metabo10050202 (2020). - DOI - PMC - PubMed
    1. Canzler S. et al. Prospects and challenges of multi-omics data integration in toxicology. Arch Toxicol 94, 371–388 (2020). - PubMed
    1. Bersanelli M. et al. Methods for the integration of multi-omics data: Mathematical aspects. BMC Bioinformatics 17, 15 (2016). - PMC - PubMed

Publication types