Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Sep 7;17(9):e1009105.
doi: 10.1371/journal.pcbi.1009105. eCollection 2021 Sep.

Pathway analysis in metabolomics: Recommendations for the use of over-representation analysis

Affiliations

Pathway analysis in metabolomics: Recommendations for the use of over-representation analysis

Cecilia Wieder et al. PLoS Comput Biol. .

Abstract

Over-representation analysis (ORA) is one of the commonest pathway analysis approaches used for the functional interpretation of metabolomics datasets. Despite the widespread use of ORA in metabolomics, the community lacks guidelines detailing its best-practice use. Many factors have a pronounced impact on the results, but to date their effects have received little systematic attention. Using five publicly available datasets, we demonstrated that changes in parameters such as the background set, differential metabolite selection methods, and pathway database used can result in profoundly different ORA results. The use of a non-assay-specific background set, for example, resulted in large numbers of false-positive pathways. Pathway database choice, evaluated using three of the most popular metabolic pathway databases (KEGG, Reactome, and BioCyc), led to vastly different results in both the number and function of significantly enriched pathways. Factors that are specific to metabolomics data, such as the reliability of compound identification and the chemical bias of different analytical platforms also impacted ORA results. Simulated metabolite misidentification rates as low as 4% resulted in both gain of false-positive pathways and loss of truly significant pathways across all datasets. Our results have several practical implications for ORA users, as well as those using alternative pathway analysis methods. We offer a set of recommendations for the use of ORA in metabolomics, alongside a set of minimal reporting guidelines, as a first step towards the standardisation of pathway analysis in metabolomics.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Over Representation Analysis (ORA).
Venn diagram representing ORA parameters corresponding to Eq 1. N represents compounds forming the background set, which covers part of the full metabolome. M represents compounds in the pathway of interest. n represents compounds of interest (i.e., differentially abundant metabolites), and k represents the overlap between the list of compounds of interest and compounds in the pathway.
Fig 2
Fig 2. Effect of background set.
A) Scatter plot of -log10 p-values of pathways when using an assay-specific background set consisting of all measurable compounds in each dataset (x-axis) compared to using a non-specific background set containing all compounds mapping to at least one KEGG pathway (y-axis). Dashed black lines represent a p-value threshold equivalent to p = 0.1. Regression lines are shown with shading representing the 95% confidence interval. B) Number of pathways significant at p ≤ 0.1 (solid bars) and the number of pathways significant at q < 0.1 (hashed bars, BH FDR correction). Datasets are ordered by number of compounds mapping to KEGG pathways. C and D) The effect of reducing the size of the background set. C) Compounds were removed from the background set at random and DA metabolites were identified based on the modified background set. D) Only non-DA compounds were removed from the background set at random. In all panels a, c & d, dashed lines represent datasets where no chromatography/electrophoresis was used. Error bars represent standard error of the mean.
Fig 3
Fig 3. Number of DA metabolites.
The effect of the number of DA metabolites in the list of metabolites of interest on the number of significant pathways (p ≤ 0.1) in the Labbé et al. dataset. Results corresponding to Bonferroni thresholds are denoted by red markers while those corresponding to BH FDR thresholds are denoted by black markers. Marker shape (circle, cross, or triangle) represents the adjusted p-value threshold for DA metabolite selection (0.005, 0.05, and 0.1 respectively).
Fig 4
Fig 4. Comparison of pathway databases and database updates.
A) Pathway size distribution of KEGG, Reactome, and HumanCyc databases. Violin plots show the distribution of pathway size (number of compounds, log10 transformed). Bold vertical lines show median, dashed vertical lines show lower and upper quartiles. B) Comparison of Reactome human pathway set (R-HSA) releases spanning the years 2017 (R61, June 2017) to 2020 (R75, December 2020). Data for release 67 was not available. Dot colour corresponds to release version, with lighter colours representing newer releases.
Fig 5
Fig 5. Metabolite misidentification.
The effect of compound misidentification by molecular weight (20ppm window) (bars in dark colours) and chemical formula (bars in light colours) on the mean pathway loss rate (lower bars) and mean pathway gain rate (upper bars) averaged over 100 random resamplings at 4% misidentification. Error bars represent standard error of the mean.
Fig 6
Fig 6. The effect of assay chemical specificity on pathways accessible in the KEGG metabolic network.
Both figures a and b are based on the four assay types present in the Stevens et al. dataset. The colours in each subfigure correspond to the four assay types shown in the legend. A) KEGG reference metabolic network with compounds from each assay type highlighted on their respective pathways. KEGG network annotated using iPath 3 [22]. B) Venn diagram showing the number of KEGG pathways accessible using the compounds in each of the four assay types. Numbers outside the Venn diagram indicate the total number of pathways accessible with each assay type. Venn created using InteractiVenn [23].

Similar articles

Cited by

References

    1. Nguyen TM, Shafi A, Nguyen T, Draghici S. Identifying significantly impacted pathways: A comprehensive review and assessment. Genome Biol. 2019;20. doi: 10.1186/s13059-019-1790-4 - DOI - PMC - PubMed
    1. Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: Current approaches and outstanding challenges. Ouzounis CA, editor. PLoS Computational Biology. Public Library of Science; 2012. p. e1002375. doi: 10.1371/journal.pcbi.1002375 - DOI - PMC - PubMed
    1. Karnovsky A, Li S. Pathway Analysis for Targeted and Untargeted Metabolomics. Methods in Molecular Biology. Humana Press Inc.; 2020. pp. 387–400. doi: 10.1007/978-1-0716-0239-3_19 - DOI - PubMed
    1. Marco-Ramell A, Palau-Rodriguez M, Alay A, Tulipani S, Urpi-Sarda M, Sanchez-Pla A, et al.. Evaluation and comparison of bioinformatic tools for the enrichment analysis of metabolomics data. BMC Bioinformatics. 2018;19: 1. doi: 10.1186/s12859-017-2006-0 - DOI - PMC - PubMed
    1. García-Campos MA, Espinal-Enríquez J, Hernández-Lemus E. Pathway analysis: State of the art. Frontiers in Physiology. Frontiers Research Foundation; 2015. doi: 10.3389/fphys.2015.00383 - DOI - PMC - PubMed

Publication types