Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May 17;20(1):257.
doi: 10.1186/s12859-019-2850-1.

Improving the power of gene set enrichment analyses

Affiliations

Improving the power of gene set enrichment analyses

Joanna Roder et al. BMC Bioinformatics. .

Abstract

Background: Set enrichment methods are commonly used to analyze high-dimensional molecular data and gain biological insight into molecular or clinical phenotypes. One important category of analysis methods employs an enrichment score, which is created from ranked univariate correlations between phenotype and each molecular attribute. Estimates of the significance of the associations are determined via a null distribution generated from phenotype permutation. We investigate some statistical properties of this method and demonstrate how alternative assessments of enrichment can be used to increase the statistical power of such analyses to detect associations between phenotype and biological processes and pathways.

Results: For this category of set enrichment analysis, the null distribution is largely independent of the number of samples with available molecular data. Hence, providing the sample cohort is not too small, we show that increased statistical power to identify associations between biological processes and phenotype can be achieved by splitting the cohort into two halves and using the average of the enrichment scores evaluated for each half as an alternative test statistic. Further, we demonstrate that this principle can be extended by averaging over multiple random splits of the cohort into halves. This enables the calculation of an enrichment statistic and associated p value of arbitrary precision, independent of the exact random splits used.

Conclusions: It is possible to increase the statistical power of gene set enrichment analyses that employ enrichment scores created from running sums of univariate phenotype-attribute correlations and phenotype-permutation generated null distributions. This increase can be achieved by using alternative test statistics that average enrichment scores calculated for splits of the dataset. Apart from the special case of a close balance between up- and down-regulated genes within a gene set, statistical power can be improved, or at least maintained, by this method down to small sample sizes, where accurate assessment of univariate phenotype-gene correlations becomes unfeasible.

Keywords: Enrichment analysis; Gene set enrichment analysis; Statistical power.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Null distribution for ES and ESavg for N = 20, 40, 60, 80, 100, and 200. a HALLMARKS_MYC_TARGETS_V1, b HALLMARKS_ALLOGRAFT_REJECTION. Distributions for ES are shown in blue and those for ESavg are shown in red
Fig. 2
Fig. 2
Sampling distribution for ES and ESavg for N = 20, 40, 60, 80, 100, and 200. a HALLMARKS_MYC_TARGETS_V1, b HALLMARKS_ALLOGRAFT_REJECTION
Fig. 3
Fig. 3
Power to detect association of phenotype with HALLMARKS_MYC_TARGETS_V1 (blue) and HALLMARKS_ALLOGRAFT_REJECTION (red) with α = 0.05. Power is shown as a function of N for ES (dotted line) and ESavg (solid line)
Fig. 4
Fig. 4
Null distributions for ES and for avg>. Null distributions for<ESavg > are shown for one split (ESavg = <ESavg>), two splits, and 25 splits. All distributions are generated for one subset of 200 samples drawn from the 294-patient cohort
Fig. 5
Fig. 5
Distribution of ESavg, and < ESavg > (two splits and 25 splits) for 1000 random splitting averages. All distributions are for a single subset of 200 samples using the MYC_TARGETS_V1 gene set
Fig. 6
Fig. 6
p value distributions over dataset realizations for ES, ESavg and < ESavg > for control gene sets. a Gene Set a, b Gene Set j

References

    1. Tilford CA, Siemers NO. Gene set enrichment analysis. Methods Mol Biol. 2009;563:99–121. doi: 10.1007/978-1-60761-175-2_6. - DOI - PubMed
    1. Ackermann M, Strimmer K. A general modular framework for gene set enrichment analysis. BMC Bioinformatics. 2009;10:47. doi: 10.1186/1471-2105-10-47. - DOI - PMC - PubMed
    1. Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC. PGC-I α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003;34:267–273. doi: 10.1038/ng1180. - DOI - PubMed
    1. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gilette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. - DOI - PMC - PubMed
    1. Tamayo P, Steinhardt G, Lizerzon A, Mesirov JP. The limitations of simple gene set enrichment analysis assuming gene independence. Stat Methods Med Res. 2016;25(1):472–487. doi: 10.1177/0962280212460441. - DOI - PMC - PubMed

LinkOut - more resources