This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Jan 9:2024.01.09.574780.

doi: 10.1101/2024.01.09.574780.

PathIntegrate: Multivariate modelling approaches for pathway-based multi-omics data integration

Cecilia Wieder¹, Juliette Cooke², Clement Frainay², Nathalie Poupin², Russell Bowler³, Fabien Jourdan⁴, Katerina J Kechris⁵, Rachel Pj Lai⁶, Timothy Ebbels¹

Affiliations

¹ Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion, and Reproduction, Faculty of Medicine, Imperial College London, London, United Kingdom.
² Toxalim (Research Centre in Food Toxicology), Université de Toulouse, INRAE, ENVT, INP-Purpan, UPS, Toulouse, France.
³ National Jewish Health, 1400 Jackson Street, Denver, CO, 80206, USA.
⁴ MetaboHUB-Metatoul, National Infrastructure of Metabolomics and Fluxomics, Toulouse, France.
⁵ Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America.
⁶ Department of Infectious Disease, Faculty of Medicine, Imperial College London, London, United Kingdom.

PMID: 38260498
PMCID: PMC10802464
DOI: 10.1101/2024.01.09.574780

PathIntegrate: Multivariate modelling approaches for pathway-based multi-omics data integration

Cecilia Wieder et al. bioRxiv. 2024.

[Preprint]. 2024 Jan 9:2024.01.09.574780.

doi: 10.1101/2024.01.09.574780.

Authors

Cecilia Wieder¹, Juliette Cooke², Clement Frainay², Nathalie Poupin², Russell Bowler³, Fabien Jourdan⁴, Katerina J Kechris⁵, Rachel Pj Lai⁶, Timothy Ebbels¹

Affiliations

¹ Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion, and Reproduction, Faculty of Medicine, Imperial College London, London, United Kingdom.
² Toxalim (Research Centre in Food Toxicology), Université de Toulouse, INRAE, ENVT, INP-Purpan, UPS, Toulouse, France.
³ National Jewish Health, 1400 Jackson Street, Denver, CO, 80206, USA.
⁴ MetaboHUB-Metatoul, National Infrastructure of Metabolomics and Fluxomics, Toulouse, France.
⁵ Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America.
⁶ Department of Infectious Disease, Faculty of Medicine, Imperial College London, London, United Kingdom.

PMID: 38260498
PMCID: PMC10802464
DOI: 10.1101/2024.01.09.574780

Update in

PathIntegrate: Multivariate modelling approaches for pathway-based multi-omics data integration.
Wieder C, Cooke J, Frainay C, Poupin N, Bowler R, Jourdan F, Kechris KJ, Lai RP, Ebbels T. Wieder C, et al. PLoS Comput Biol. 2024 Mar 25;20(3):e1011814. doi: 10.1371/journal.pcbi.1011814. eCollection 2024 Mar. PLoS Comput Biol. 2024. PMID: 38527092 Free PMC article.

Abstract

As terabytes of multi-omics data are being generated, there is an ever-increasing need for methods facilitating the integration and interpretation of such data. Current multi-omics integration methods typically output lists, clusters, or subnetworks of molecules related to an outcome. Even with expert domain knowledge, discerning the biological processes involved is a time-consuming activity. Here we propose PathIntegrate, a method for integrating multi-omics datasets based on pathways, designed to exploit knowledge of biological systems and thus provide interpretable models for such studies. PathIntegrate employs single-sample pathway analysis to transform multi-omics datasets from the molecular to the pathway-level, and applies a predictive single-view or multi-view model to integrate the data. Model outputs include multi-omics pathways ranked by their contribution to the outcome prediction, the contribution of each omics layer, and the importance of each molecule in a pathway. Using semi-synthetic data we demonstrate the benefit of grouping molecules into pathways to detect signals in low signal-to-noise scenarios, as well as the ability of PathIntegrate to precisely identify important pathways at low effect sizes. Finally, using COPD and COVID-19 data we showcase how PathIntegrate enables convenient integration and interpretation of complex high-dimensional multi-omics datasets. The PathIntegrate Python package is available at https://github.com/cwieder/PathIntegrate.

PubMed Disclaimer

Figures

**Figure 1:. Pathway transformation enhances sensitivity to low signal-to-noise signals.**
y axis shows proportion of MWU tests significant at Bonferroni p ≤ 0.05, performed either on the pathway-level data or the molecular level data, at varying effect sizes shown on x-axis. Semi-synthetic data based on COVID-19 dataset.

**Figure 2:. PathIntegrate Multi-View (left) and Single-View (right) modelling frameworks for multi-omics pathway-based integration.**
Frameworks are outlined in terms of their input data, pathway-transformation stage, statistical model, and outputs. Blue data blocks represent omics data which has been transformed from the molecular ( $X_{N \times M}$ ) space to the pathway ( $A_{N \times P}$ ) space using ssPA. Both Single-View and Multi-View make use of the same multi-omics pathway set.

**Figure 3:. Performance of PathIntegrate and DIABLO vs. effect size, based on semi-synthetic data measured by AUROC.**
COPDgene metabolomics and proteomics data were integrated in each model. a. Ability to correctly predict sample outcomes (case vs. control). We compared PathIntegrate Multi-View and Single-View to DIABLO using both molecular and pathway-level multi-omics data. b. Ability to correctly recall target enriched pathway. We compared DIABLO RGCCA model loadings to the Multi-View MB-PLS VIP and Single-View PLS VIP statistics for pathway importance. c. Comparison of PathIntegrate Multi-View classification performance using KEGG and Reactome pathway databases as well as molecular-level model. d. Effect of sample size on PathIntegrate Multi-View classification performance. For panels a-c error bars indicate 95% confidence intervals on the mean AUROC (in some cases they appear smaller than point sizes).

**Figure 4:. PathIntegrate Multi-View applied to COPDgene multi-omics data.**
A. Superscores plot based on multi-omics (metabolomics, proteomics, and transcriptomics pathways) across four latent variables. B. Omics view importances across latent variables. Values represent mean and SEM across 100 bootstrap samples. C. Top five pathways per omics block. D. Top 15 pathways across omics blocks categorised by Reactome parent pathway. E. kPCA ssPA scores from top 15 pathways used to cluster samples using Euclidean distance and Ward linkage. F. Heatmap showing Spearman correlation between superscores across four latent variables and clinical metadata. Asterisks indicate Bonferroni p-value ≤ 0.05. Definitions of clinical variables are in Supplementary Table 2.

**Figure 5:. Network visualisation with PathIntegrate interactive network explorer.**
PathIntegrate Multi-View was applied to COPDgene multi-omics data. A. Multi-omics network view of global Reactome hierarchy DAG. Only pathways with sufficient coverage (≥ 2 molecules per pathway) are shown as nodes. Edges represent parent-child relationships between pathways as defined by Reactome. Nodes are coloured by Reactome superpathway membership. Node size corresponds to pathway coverage. B. Network view of ‘Carnitine metabolism’ pathway (zoomed-in susbset of (a)) and close neighbourhood within the Reactome pathway hierarchy. Nodes are coloured by p-values obtained from PathIntegrate Multi-View model.

**Figure 6:. PathIntegrate Single-View applied to COVID-19 multi-omics data.**
A. Kernel density distribution of log₁₀ pathway sizes in the COVID dataset per omics view. Pathway size refers to the number of molecules annotated to each pathway present in the COVID datasets. B. Number of pathways with sufficient coverage in the COVID dataset in each omics view. C. Multi-omics pathway features identified using recursive feature elimination from the PathIntegrate Single-View random forest model, ranked by Gini importance. D. Molecular level importances derived from the ‘ADORA2B mediated anti-inflammatory cytokines production’ (R-HSA-9660821) SVD pathway scores. Datapoints represent mean and standard deviation of loadings of each molecule on PC1 across 200 bootstrap samples.

See this image and copyright information in PMC

References

1. Krassowski M., Das V., Sahu S. K. & Misra B. B. State of the Field in Multi-Omics Research: From Computational Needs to Data Mining and Sharing. Front Genet 11, 1598 (2020). - PMC - PubMed
1. Subramanian I., Verma S., Kumar S., Jere A. & Anamika K. Multi-omics Data Integration, Interpretation, and Its Application. Bioinformatics and Biology Insights vol. 14 Preprint at 10.1177/1177932219899051 (2020). - DOI - PMC - PubMed
1. Eicher T. et al. Metabolomics and multi-omics integration: A survey of computational methods and resources. Metabolites vol. 10 Preprint at 10.3390/metabo10050202 (2020). - DOI - PMC - PubMed
1. Canzler S. et al. Prospects and challenges of multi-omics data integration in toxicology. Arch Toxicol 94, 371–388 (2020). - PubMed
1. Bersanelli M. et al. Methods for the integration of multi-omics data: Mathematical aspects. BMC Bioinformatics 17, 15 (2016). - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

PathIntegrate: Multivariate modelling approaches for pathway-based multi-omics data integration

Affiliations

PathIntegrate: Multivariate modelling approaches for pathway-based multi-omics data integration

Authors

Affiliations

Update in

Abstract

Figures

Similar articles

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

This is a preprint.

Update in

Abstract

Figures

Similar articles

References

Publication types

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials