Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Feb 13:4:4092.
doi: 10.1038/srep04092.

Importance of collection in gene set enrichment analysis of drug response in cancer cell lines

Affiliations

Importance of collection in gene set enrichment analysis of drug response in cancer cell lines

Alain R Bateman et al. Sci Rep. .

Abstract

Gene set enrichment analysis (GSEA) associates gene sets and phenotypes, its use is predicated on the choice of a pre-defined collection of sets. The defacto standard implementation of GSEA provides seven collections yet there are no guidelines for the choice of collections and the impact of such choice, if any, is unknown. Here we compare each of the standard gene set collections in the context of a large dataset of drug response in human cancer cell lines. We define and test a new collection based on gene co-expression in cancer cell lines to compare the performance of the standard collections to an externally derived cell line based collection. The results show that GSEA findings vary significantly depending on the collection chosen for analysis. Henceforth, collections should be carefully selected and reported in studies that leverage GSEA.

PubMed Disclaimer

Figures

Figure 1
Figure 1
(A) Number and identity of gene sets identified as highly enriched (absolute normalized enrichment score > 2.0, maximum FDR < 25% across all drugs). (B) Heatmap of gene collection overlap score (g-index).
Figure 2
Figure 2
(A) Density plot representing the distribution of normalized enrichement scores for all drugs in each collection individually. (B) Heatmap of the number of highly enriched gene sets (absolute normalized enrichement score > 2.0, FDR < 25%) for each drug, in each collection. Gene set collections are listed along the bottom of the figure and drugs along the right. Darker hues of blue indicate a greater number of enriched gene sets for a particular drug.
Figure 3
Figure 3
(A) Fractional contribution of each collection to the set of top scoring gene sets with n gene sets per drug. n is plotted along the abscise. The ordinance shows the fraction of top gene sets contributed by each collection to the set of top scoring gene sets. As n increases, a higher number of gene sets per drug are assumed to be relevant or significant. Collection C2 is the highest contributor by a large margin, followed by C4, all other collections contribute to a negligible degree. The fractional contribution of C4 peaks before 10 top gene sets per drug, coinciding with C2's low. There is a slight trend downward in C4's contribution afterwards and a lesser trend upwards in the case of C2. (B) Fractional contribution of all Broad's collections plus our data-driven gene set collection, referred to as HGSK.
Figure 4
Figure 4. Creation of the HGSK set collection is done by creating a gene-gene distance measure based on the reciprocal of a gene-gene correlation matrix from the expression of tumour cell lines in the GSK data set.
Genes are clustered using traditional hierarchical clustering based on the distance measure. Depth first recursive tree generation is done, iterating over the prior sub-trees of cluster. Sets containing less than 15 genes or more than 500 are discarded.
Figure 5
Figure 5. Overall analysis design used in our comparative study.
First we calculated the overlap between each pair of gene set collections. Second we used a large pharmacogenomic dataset (CGP) to rank all the genes with respect to their association to response to each of the 138 drugs. Third we used these rankings together with the gene set collections to run multiple GSEA. Fourth the results are aggregated to compare the most enriched gene sets across collections. The results are then interpreted by taking into account the overlap between collections.

References

    1. Lockhart D. J. et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotech 14, 1675–1680 (1996). - PubMed
    1. Shi L. et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 24, 1151–1161 (2006). - PMC - PubMed
    1. Haibe-Kains B. et al. A Three-Gene Model to Robustly Identify Breast Cancer Molecular Subtypes. JNCI J. Natl. Cancer Inst. 104, 311–325 (2012). - PMC - PubMed
    1. Hung J.-H., Yang T.-H., Hu Z., Weng Z. & DeLisi C. Gene set enrichment analysis: performance evaluation and usage guidelines. Brief. Bioinform. 13, 281–291 (2011). - PMC - PubMed
    1. Maciejewski H. Gene set analysis methods: statistical models and methodological differences. Brief. Bioinform. 10.1093/bib/bbt002 (2013). - PMC - PubMed

Substances