Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Sep 17;5(9):e12693.
doi: 10.1371/journal.pone.0012693.

Self-contained gene-set analysis of expression data: an evaluation of existing and novel methods

Affiliations

Self-contained gene-set analysis of expression data: an evaluation of existing and novel methods

Brooke L Fridley et al. PLoS One. .

Abstract

Gene set methods aim to assess the overall evidence of association of a set of genes with a phenotype, such as disease or a quantitative trait. Multiple approaches for gene set analysis of expression data have been proposed. They can be divided into two types: competitive and self-contained. Benefits of self-contained methods include that they can be used for genome-wide, candidate gene, or pathway studies, and have been reported to be more powerful than competitive methods. We therefore investigated ten self-contained methods that can be used for continuous, discrete and time-to-event phenotypes. To assess the power and type I error rate for the various previously proposed and novel approaches, an extensive simulation study was completed in which the scenarios varied according to: number of genes in a gene set, number of genes associated with the phenotype, effect sizes, correlation between expression of genes within a gene set, and the sample size. In addition to the simulated data, the various methods were applied to a pharmacogenomic study of the drug gemcitabine. Simulation results demonstrated that overall Fisher's method and the global model with random effects have the highest power for a wide range of scenarios, while the analysis based on the first principal component and Kolmogorov-Smirnov test tended to have lowest power. The methods investigated here are likely to play an important role in identifying pathways that contribute to complex traits.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Pairwise scatterplot of power for the various methods for scenarios with standard deviation (σ) of 6.0.
Figure 2
Figure 2. Plots of power for all methods.
Power is plotted as a function of (A) sample size, (B) the correlation between expression values within the gene set (ρ), (C) the proportion of probes associated with the phenotype, and (D) the calculated R2, the proportion of variation in the quantitative phenotype explained by the gene expression values in the pathway. The average power values are based on all simulated non-null scenarios. Plot (B) excludes scenarios with between-probe correlation structure defined by the gemcitabine pathway, and only shows fixed-correlation scenarios (ρ = 0, 0.1, 0.3). Plots (B), (C), and (D) are based on sample size of 100. Similar plots for sample sizes of 20 and 500 are shown in Figure S1. For plots (C) and (D) a kernel smoother was used to fit a curve to the data. Scenarios with all expression probes being associated with the trait were excluded from plot (C), as all the methods had very high power in this situation.
Figure 3
Figure 3. Power of Fisher's Method (FM) as a function of sample size, correlation of expression values between probes (ρ), and R2 (proportion of variation in the quantitative phenotype explained by the gene expression values in the gene set).

References

    1. Goeman JJ, Buhlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007;23:980–987. - PubMed
    1. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102:15545–15550. - PMC - PubMed
    1. Dennis G, Jr, Sherman BT, Hosack DA, Yang J, Gao W, et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003;4:P3. - PubMed
    1. Allison DB, Cui X, Page GP, Sabripour M. Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet. 2006;7:55–65. - PubMed
    1. Liu Q, Dinu I, Adewale AJ, Potter JD, Yasui Y. Comparative evaluation of gene-set analysis methods. BMC Bioinformatics. 2007;8:431. - PMC - PubMed

Publication types