Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2020 Jun 30:11:654.
doi: 10.3389/fgene.2020.00654. eCollection 2020.

Gene Set Analysis: Challenges, Opportunities, and Future Research

Affiliations
Review

Gene Set Analysis: Challenges, Opportunities, and Future Research

Farhad Maleki et al. Front Genet. .

Abstract

Gene set analysis methods are widely used to provide insight into high-throughput gene expression data. There are many gene set analysis methods available. These methods rely on various assumptions and have different requirements, strengths and weaknesses. In this paper, we classify gene set analysis methods based on their components, describe the underlying requirements and assumptions for each class, and provide directions for future research in developing and evaluating gene set analysis methods.

Keywords: gene expression; gene set analysis; gene set database; gene set enrichment; sensitivity; specificity.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Expression matrix for a pairwise comparison where A(c1),,A(c||C||) columns represent control samples and A(t1),,A(t||T||) columns represent case samples. In this figure, gi(cj) and gi(tj) represent the expression measures for the ith gene in the cjth control sample and tjth case sample, respectively.
Figure 2
Figure 2
A schematic view of over-representation analysis (ORA) and univariate and multivariate FCS methods.
Figure 3
Figure 3
Visualization of gene sampling under the competitive null hypothesis. In this figure, gi(cj) and gi(tj) represent the expression measures for the ith gene in the cjth control sample and tjth case sample, respectively. A competitive null hypothesis states that there is no difference between the expression patterns of genes in a given gene set in comparison to that of the rest of the genes. For example, given a gene set Gi consisting of three genes Gi = {g2, g4, g5}, depicted in green, the competitive null hypothesis states that there is no difference in the expression pattern of these genes compared to that of the rest of genes, i.e., g1, g3, g6, … , gm—denoted as Ḡi and depicted in blue. In univariate methods, for each gene gi, a gene score sgi is calculated using the expression measures for gi across control and case samples. Then a gene set score f(Gi)—which is representative of the difference in the expression pattern of genes in Gi in control samples vs. case samples—is calculated using the gene scores of genes in Gi. Often a gene sampling approach is used for the significance assessment of the gene set score f(Gi). In a multivariate setting, the intermediate step of summarizing expression values for each gene to a gene score sgi is omitted, and f(Gi) is directly calculated from the expression values of genes in Gi.
Figure 4
Figure 4
Visualization of phenotype permutation under the self-contained null hypothesis. In this figure, gi(cj) and gi(tj) represent the expression measures for the ith gene in the cjth control sample and tjth case sample, respectively. The self-contained null hypothesis states that the expression pattern of genes within a gene set does not differ between case and control samples. For example, given a gene set Gi consisting of three genes Gi = {g2, g4, g5}, the self-contained null hypothesis states that there is no difference in the expression pattern of these genes in control samples vs. case samples. It should be noted that the self-contained null hypothesis does not concern the rest of genes, i.e., genes not in Gi, which are shown in white here. In univariate methods, for each gene gi, a gene score sgi is calculated using the expression measures for gi across control and case samples. A gene set score f(Gi)—which is representative of the difference in the expression pattern of genes in Gi in control samples vs. case samples—is calculated using the gene scores of genes in Gi. Often a phenotype permutation approach is used for significance assessment of the gene set score f(Gi). In a multivariate setting, the intermediate step of summarizing expression values for each gene to a gene score sgi is omitted, and f(Gi) is directly calculated from the expression values of genes in Gi.
Figure 5
Figure 5
Visualization of phenotype permutation under the self-contained hybrid null hypothesis. This type of hypothesis states that the relative expression pattern of genes within a gene set is not differentially associated with phenotypes. For example, given a gene set Gi consisting of three genes Gi = {g2, g4, g5}, the hybrid null hypothesis states that there is no difference in the relative expression pattern of these genes between phenotypes. In this figure, sgi represents the gene score for the gene gi. Unlike the sample permutation approach used under a self-contained null hypothesis, not only do gene scores for genes in Gi contribute to the calculation of f(Gi) but also gene scores for genes in Ḡi can contribute to this calculation. For example, the distribution of gene scores for genes in Ḡi can affect the enrichment score of Gi calculated by GSEA. The contribution of genes in Gi and genes in Ḡi are depicted with solid green lines and dashed blue lines, respectively. See Figure S1 for a visualization of gene sampling under the competitive hybrid null hypothesis.

References

    1. Abdollahi A., Schwager C., Kleeff J., Esposito I., Domhan S., Peschke P., et al. (2007). Transcriptional network governing the angiogenic switch in human pancreatic cancer. Proc. Natl. Acad. Sci. U.S.A. 104, 12890–12895. 10.1073/pnas.0705505104 - DOI - PMC - PubMed
    1. Ackermann M., Strimmer K. (2009). A general modular framework for gene set enrichment analysis. BMC Bioinform. 10:47. 10.1186/1471-2105-10-47 - DOI - PMC - PubMed
    1. Amberger J., Bocchini C. A., Scott A. F., Hamosh A. (2009). Mckusick's online mendelian inheritance in man (OMIM®). Nucleic Acids Res. 37(Suppl 1):D793–D796. 10.1093/nar/gkn665 - DOI - PMC - PubMed
    1. Araki H., Knapp C., Tsai P., Print C. (2012). Genesetdb: a comprehensive meta-database, statistical and visualisation framework for gene set analysis. FEBS Open Bio 2, 76–82. 10.1016/j.fob.2012.04.003 - DOI - PMC - PubMed
    1. Bateman A. R., El-Hachem N., Beck A. H., Aerts H. J., Haibe-Kains B. (2014). Importance of collection in gene set enrichment analysis of drug response in cancer cell lines. Sci. Rep. 4:4092. 10.1038/srep04092 - DOI - PMC - PubMed

LinkOut - more resources