Gene Set Analysis: Challenges, Opportunities, and Future Research

Farhad Maleki¹, Katie Ovens¹, Daniel J Hogan¹, Anthony J Kusalik¹

Affiliations

PMID: 32695141
PMCID: PMC7339292
DOI: 10.3389/fgene.2020.00654

Review

Gene Set Analysis: Challenges, Opportunities, and Future Research

Farhad Maleki et al. Front Genet. 2020.

. 2020 Jun 30:11:654.

doi: 10.3389/fgene.2020.00654. eCollection 2020.

Authors

Farhad Maleki¹, Katie Ovens¹, Daniel J Hogan¹, Anthony J Kusalik¹

Affiliation

¹ Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada.

PMID: 32695141
PMCID: PMC7339292
DOI: 10.3389/fgene.2020.00654

Abstract

Gene set analysis methods are widely used to provide insight into high-throughput gene expression data. There are many gene set analysis methods available. These methods rely on various assumptions and have different requirements, strengths and weaknesses. In this paper, we classify gene set analysis methods based on their components, describe the underlying requirements and assumptions for each class, and provide directions for future research in developing and evaluating gene set analysis methods.

Keywords: gene expression; gene set analysis; gene set database; gene set enrichment; sensitivity; specificity.

PubMed Disclaimer

Figures

**Figure 1**
Expression matrix for a pairwise comparison where $A^{(c_{1})}, \dots, A^{(c_{| | C | |})}$ columns represent control samples and $A^{(t_{1})}, \dots, A^{(t_{| | T | |})}$ columns represent case samples. In this figure, $g_{i}^{(c_{j})}$ and $g_{i}^{(t_{j})}$ represent the expression measures for the i^th gene in the ${c_{j}}^{t h}$ control sample and ${t_{j}}^{t h}$ case sample, respectively.

**Figure 2**
A schematic view of over-representation analysis (ORA) and univariate and multivariate FCS methods.

**Figure 3**
Visualization of gene sampling under the competitive null hypothesis. In this figure, $g_{i}^{(c_{j})}$ and $g_{i}^{(t_{j})}$ represent the expression measures for the i^th gene in the ${c_{j}}^{t h}$ control sample and ${t_{j}}^{t h}$ case sample, respectively. A competitive null hypothesis states that there is no difference between the expression patterns of genes in a given gene set in comparison to that of the rest of the genes. For example, given a gene set G_i consisting of three genes G_i = {g₂, g₄, g₅}, depicted in green, the competitive null hypothesis states that there is no difference in the expression pattern of these genes compared to that of the rest of genes, i.e., g₁, g₃, g₆, … , g_m—denoted as Ḡ_i and depicted in blue. In univariate methods, for each gene g_i, a gene score s_{g_i} is calculated using the expression measures for g_i across control and case samples. Then a gene set score f(G_i)—which is representative of the difference in the expression pattern of genes in G_i in control samples vs. case samples—is calculated using the gene scores of genes in G_i. Often a gene sampling approach is used for the significance assessment of the gene set score f(G_i). In a multivariate setting, the intermediate step of summarizing expression values for each gene to a gene score s_{g_i} is omitted, and f(G_i) is directly calculated from the expression values of genes in G_i.

**Figure 4**
Visualization of phenotype permutation under the self-contained null hypothesis. In this figure, $g_{i}^{(c_{j})}$ and $g_{i}^{(t_{j})}$ represent the expression measures for the i^th gene in the ${c_{j}}^{t h}$ control sample and ${t_{j}}^{t h}$ case sample, respectively. The self-contained null hypothesis states that the expression pattern of genes within a gene set does not differ between case and control samples. For example, given a gene set G_i consisting of three genes G_i = {g₂, g₄, g₅}, the self-contained null hypothesis states that there is no difference in the expression pattern of these genes in control samples vs. case samples. It should be noted that the self-contained null hypothesis does not concern the rest of genes, i.e., genes not in G_i, which are shown in white here. In univariate methods, for each gene g_i, a gene score s_{g_i} is calculated using the expression measures for g_i across control and case samples. A gene set score f(G_i)—which is representative of the difference in the expression pattern of genes in G_i in control samples vs. case samples—is calculated using the gene scores of genes in G_i. Often a phenotype permutation approach is used for significance assessment of the gene set score f(G_i). In a multivariate setting, the intermediate step of summarizing expression values for each gene to a gene score s_{g_i} is omitted, and f(G_i) is directly calculated from the expression values of genes in G_i.

**Figure 5**
Visualization of phenotype permutation under the self-contained hybrid null hypothesis. This type of hypothesis states that the relative expression pattern of genes within a gene set is not differentially associated with phenotypes. For example, given a gene set G_i consisting of three genes G_i = {g₂, g₄, g₅}, the hybrid null hypothesis states that there is no difference in the relative expression pattern of these genes between phenotypes. In this figure, s_{g_i} represents the gene score for the gene g_i. Unlike the sample permutation approach used under a self-contained null hypothesis, not only do gene scores for genes in G_i contribute to the calculation of f(G_i) but also gene scores for genes in Ḡ_i can contribute to this calculation. For example, the distribution of gene scores for genes in Ḡ_i can affect the enrichment score of G_i calculated by GSEA. The contribution of genes in G_i and genes in Ḡ_i are depicted with solid green lines and dashed blue lines, respectively. See Figure S1 for a visualization of gene sampling under the competitive hybrid null hypothesis.

See this image and copyright information in PMC

References

1. Abdollahi A., Schwager C., Kleeff J., Esposito I., Domhan S., Peschke P., et al. (2007). Transcriptional network governing the angiogenic switch in human pancreatic cancer. Proc. Natl. Acad. Sci. U.S.A. 104, 12890–12895. 10.1073/pnas.0705505104 - DOI - PMC - PubMed
1. Ackermann M., Strimmer K. (2009). A general modular framework for gene set enrichment analysis. BMC Bioinform. 10:47. 10.1186/1471-2105-10-47 - DOI - PMC - PubMed
1. Amberger J., Bocchini C. A., Scott A. F., Hamosh A. (2009). Mckusick's online mendelian inheritance in man (OMIM®). Nucleic Acids Res. 37(Suppl 1):D793–D796. 10.1093/nar/gkn665 - DOI - PMC - PubMed
1. Araki H., Knapp C., Tsai P., Print C. (2012). Genesetdb: a comprehensive meta-database, statistical and visualisation framework for gene set analysis. FEBS Open Bio 2, 76–82. 10.1016/j.fob.2012.04.003 - DOI - PMC - PubMed
1. Bateman A. R., El-Hachem N., Beck A. H., Aerts H. J., Haibe-Kains B. (2014). Importance of collection in gene set enrichment analysis of drug response in cancer cell lines. Sci. Rep. 4:4092. 10.1038/srep04092 - DOI - PMC - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Gene Set Analysis: Challenges, Opportunities, and Future Research

Affiliation

Gene Set Analysis: Challenges, Opportunities, and Future Research

Authors

Affiliation

Abstract

Figures

References

Publication types

LinkOut - more resources

Full Text Sources

Other Literature Sources