Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 16;19(5):e0302696.
doi: 10.1371/journal.pone.0302696. eCollection 2024.

Assessment of Gene Set Enrichment Analysis using curated RNA-seq-based benchmarks

Affiliations

Assessment of Gene Set Enrichment Analysis using curated RNA-seq-based benchmarks

Julián Candia et al. PLoS One. .

Abstract

Pathway enrichment analysis is a ubiquitous computational biology method to interpret a list of genes (typically derived from the association of large-scale omics data with phenotypes of interest) in terms of higher-level, predefined gene sets that share biological function, chromosomal location, or other common features. Among many tools developed so far, Gene Set Enrichment Analysis (GSEA) stands out as one of the pioneering and most widely used methods. Although originally developed for microarray data, GSEA is nowadays extensively utilized for RNA-seq data analysis. Here, we quantitatively assessed the performance of a variety of GSEA modalities and provide guidance in the practical use of GSEA in RNA-seq experiments. We leveraged harmonized RNA-seq datasets available from The Cancer Genome Atlas (TCGA) in combination with large, curated pathway collections from the Molecular Signatures Database to obtain cancer-type-specific target pathway lists across multiple cancer types. We carried out a detailed analysis of GSEA performance using both gene-set and phenotype permutations combined with four different choices for the Kolmogorov-Smirnov enrichment statistic. Based on our benchmarks, we conclude that the classic/unweighted gene-set permutation approach offered comparable or better sensitivity-vs-specificity tradeoffs across cancer types compared with other, more complex and computationally intensive permutation methods. Finally, we analyzed other large cohorts for thyroid cancer and hepatocellular carcinoma. We utilized a new consensus metric, the Enrichment Evidence Score (EES), which showed a remarkable agreement between pathways identified in TCGA and those from other sources, despite differences in cancer etiology. This finding suggests an EES-based strategy to identify a core set of pathways that may be complemented by an expanded set of pathways for downstream exploratory analysis. This work fills the existing gap in current guidelines and benchmarks for the use of GSEA with RNA-seq data and provides a framework to enable detailed benchmarking of other RNA-seq-based pathway analysis tools.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Cancer-type-specific pathways.
Procedure to identify lists of preselected, target, and positive-control pathways using TCGA-BRCA (breast cancer) as example. See S2 Table for full details.
Fig 2
Fig 2. Significant TCGA-BRCA positive control pathways across different weight parameter choices.
(a) Gene-set permutation with p—value < 0.05. (b) Gene-set permutation with p—value < 0.01. (c) Phenotype permutation with p—value < 0.05. (d) Phenotype permutation with p—value < 0.01. GSEA enrichment statistics: classic (“cl”), weight parameter p = 1 (“p1”), weight parameter p = 1.5 (“p1.5”), and weight parameter p = 2 (“p2”).
Fig 3
Fig 3. Significant TCGA-BRCA positive control pathways using gene-set (“gs”) and phenotype (“ph”) permutation approaches for different enrichment statistics.
The significance criterion was p—value < 0.05. (a) Classic (unweighted). (b) Weight parameter p = 1. (c) Weight parameter p = 1.5. (d) Weight parameter p = 2.
Fig 4
Fig 4. ROC curves for different GSEA and ORA approaches using 72 TCGA-BRCA positive control pathways and 1000 randomized negative controls.
(a) Gene-set permutation GSEA. (b) Phenotype permutation GSEA. (c) Signed ORA. (d) Unsigned ORA. GSEA approaches used different enrichment statistics, as indicated. ORA approaches used Bonferroni and Benjamini-Hochberg (B-H) adjusted q-values as different inclusion criteria to select differentially expressed genes, as indicated.
Fig 5
Fig 5. AUC across TCGA projects for different GSEA and ORA approaches using cancer-type-specific positive control pathways and 1000 randomized negative controls.
(a) Gene-set permutation GSEA. (b) Phenotype permutation GSEA. (c) Signed ORA. (d) Unsigned ORA. GSEA approaches used different enrichment statistics, as indicated. ORA approaches used Bonferroni and Benjamini-Hochberg (B-H) adjusted q-values as different inclusion criteria to select differentially expressed genes, as indicated.
Fig 6
Fig 6. Pathway-level enrichment evidence scores for thyroid cancer and hepatocellular carcinoma cohorts.
(a) Comparison between significant target pathways in REBC-THYR vs TCGA-THCA. (b) Contingency table of grouped EES intervals in REBC-THYR vs TCGA-THCA (Fisher’s exact test p—value = 2.5 × 10−6). (c) Comparison between significant target pathways in MO-HCC vs TCGA-LIHC. (d) Contingency table of grouped EES intervals in MO-HCC vs TCGA-LIHC (Fisher’s exact test p—value = 4.4 × 10−15). Distance to the diagonal is represented with increasingly darker shades of blue.
Fig 7
Fig 7. Gene-level enrichment evidence scores for thyroid cancer cohorts.
(a) Comparison between EES leading-edge genes in REBC-THYR vs TCGA-THCA for genes in LUI_THYROID_CANCER_CLUSTER_1, previously identified as a high-consensus tumor-enriched pathway in thyroid cancer. (b) Contingency table of grouped EES intervals for the same case as in panel (a) (Fisher’s exact test p—value = 4.6 × 10−8). (c) Network representation showing seven core thyroid cancer pathways and high-consensus leading-edge genes (|EES| ≥ 3 in at least one of the cohorts). Only genes connected to two or more pathways are shown.

Similar articles

Cited by

References

    1. Nguyen TM, Shafi A, Nguyen T, Draghici S. Identifying significantly impacted pathways: a comprehensive review and assessment. Genome Biol. 2019;20(1):1–15. doi: 10.1186/s13059-019-1882-1 - DOI - PMC - PubMed
    1. Maleki F, Ovens K, Hogan DJ, Kusalik AJ. Gene set analysis: challenges, opportunities, and future research. Front Genet. 2020;11:654. doi: 10.3389/fgene.2020.00654 - DOI - PMC - PubMed
    1. Xie C, Jauhari S, Mora A. Popularity and performance of bioinformatics software: the case of gene set analysis. BMC Bioinform. 2021;22(1):1–16. doi: 10.1186/s12859-021-04124-5 - DOI - PMC - PubMed
    1. Mubeen S, Kodamullil AT, Hofmann-Apitius M, Domingo-Fernández D. On the influence of several factors on pathway enrichment analysis. Brief Bioinform. 2022;23(3):1–13. doi: 10.1093/bib/bbac143 - DOI - PMC - PubMed
    1. Liberzon A, Subramanian A, Pinchback R, Thorvaldsdóttir H, Tamayo P, Mesirov JP. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27(12):1739–40. doi: 10.1093/bioinformatics/btr260 - DOI - PMC - PubMed