Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Feb;25(1):472-87.
doi: 10.1177/0962280212460441. Epub 2012 Oct 14.

The limitations of simple gene set enrichment analysis assuming gene independence

Affiliations

The limitations of simple gene set enrichment analysis assuming gene independence

Pablo Tamayo et al. Stat Methods Med Res. 2016 Feb.

Abstract

Since its first publication in 2003, the Gene Set Enrichment Analysis method, based on the Kolmogorov-Smirnov statistic, has been heavily used, modified, and also questioned. Recently a simplified approach using a one-sample t-test score to assess enrichment and ignoring gene-gene correlations was proposed by Irizarry et al. 2009 as a serious contender. The argument criticizes Gene Set Enrichment Analysis's nonparametric nature and its use of an empirical null distribution as unnecessary and hard to compute. We refute these claims by careful consideration of the assumptions of the simplified method and its results, including a comparison with Gene Set Enrichment Analysis's on a large benchmark set of 50 datasets. Our results provide strong empirical evidence that gene-gene correlations cannot be ignored due to the significant variance inflation they produced on the enrichment scores and should be taken into account when estimating gene set enrichment significance. In addition, we discuss the challenges that the complex correlation structure and multi-modality of gene sets pose more generally for gene set enrichment methods.

Keywords: Gene set enrichment analysis; gene expression.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Histograms of z-scores for 1,000 permutations of the samples (grey) and gene identifiers (black), and the SEA null distribution N(0, 1) for the A) P53 and B) Pancreas datasets. The legend also shows the mean and variance of the distributions.
Figure 2
Figure 2
Histograms of gene correlations and estimated variance inflation for the P53 (A and B) and Pancreas datasets (C and D).
Figure 3
Figure 3
Histograms of p-values and q-values obtained by running SEA and GSEA on 1,000 randomly permuted phenotypes in the P53 dataset (A) and the Pancreas dataset (B). The y-axis shows the percentage of gene sets results.
Figure 4
Figure 4
Percentage of gene sets with FDR less than 0.05 and 0.25 using SEA and GSEA in 1,000 permutations of the phenotype labels for each dataset in the benchmark set.
Figure 5
Figure 5
Histograms of p-values and q-values obtained by running SEA and GSEA on 1,000 randomly permuted phenotypes and randomized gene identifiers for the P53 dataset (A) and the Pancreas dataset (B). In contrast with Fig 3 here the gene identifiers have also been randomized (gene-gene correlations not preserved). The y-axis shows the percentage of gene sets results.
Figure 6
Figure 6
GSEA individual gene set enrichment plots: examples of top scoring gene sets that display complex behavior.

References

    1. Pavlidis P, Lewis DP, Noble WS. Exploring gene expression data with class scores. Pac Symp Biocomput. 2002:474–485. - PubMed
    1. Gerstein M, Jansen R. The current excitement in bioinformatics-analysis of whole-genome expression data: how does it relate to protein structure and function? Curr Opin Struct Biol. 2000;10:574–584. - PubMed
    1. Hakak Y, Walker JR, Li C, Wong WH, Davis KL, et al. Genome-wide expression analysis reveals dysregulation of myelination-related genes in chronic schizophrenia. Proc Natl Acad Sci U S A. 2001;98:4746–4751. - PMC - PubMed
    1. Mirnics K, Middleton FA, Marquez A, Lewis DA, Levitt P. Molecular characterization of schizophrenia viewed by microarray analysis of gene expression in prefrontal cortex. Neuron. 2000;28:53–67. - PubMed
    1. Zien A, Kuffner R, Zimmer R, Lengauer T. Analysis of gene expression data with pathway scores. Proc Int Conf Intell Syst Mol Biol. 2000;8:407–417. - PubMed

Publication types

MeSH terms