Importance of collection in gene set enrichment analysis of drug response in cancer cell lines

Alain R Bateman¹, Nehme El-Hachem¹, Andrew H Beck², Hugo J W L Aerts³, Benjamin Haibe-Kains⁴

Affiliations

¹ Bioinformatics and Computational Genomics Laboratory, Institut de Recherches Cliniques de Montréal, University of Montreal, Montreal, Quebec, Canada.
² Department of Pathology, Beth Israel Deaconess Medical Center and Harvard Medical School, Boston, MA, USA.
³ 1] Department of Biostatistics and Computational Biology and Center for Cancer Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA [2] Department of Radiation Oncology & Radiology, Dana-Farber Cancer Institute, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA [3] Department of Radiation Oncology, Maastricht University, Maastricht, The Netherlands.
⁴ 1] Bioinformatics and Computational Genomics Laboratory, Institut de Recherches Cliniques de Montréal, University of Montreal, Montreal, Quebec, Canada [2] Ontario Cancer Institute, Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada.

PMID: 24522610
PMCID: PMC3923229
DOI: 10.1038/srep04092

Importance of collection in gene set enrichment analysis of drug response in cancer cell lines

Alain R Bateman et al. Sci Rep. 2014.

. 2014 Feb 13:4:4092.

doi: 10.1038/srep04092.

Authors

Alain R Bateman¹, Nehme El-Hachem¹, Andrew H Beck², Hugo J W L Aerts³, Benjamin Haibe-Kains⁴

Affiliations

¹ Bioinformatics and Computational Genomics Laboratory, Institut de Recherches Cliniques de Montréal, University of Montreal, Montreal, Quebec, Canada.
² Department of Pathology, Beth Israel Deaconess Medical Center and Harvard Medical School, Boston, MA, USA.
³ 1] Department of Biostatistics and Computational Biology and Center for Cancer Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA [2] Department of Radiation Oncology & Radiology, Dana-Farber Cancer Institute, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA [3] Department of Radiation Oncology, Maastricht University, Maastricht, The Netherlands.
⁴ 1] Bioinformatics and Computational Genomics Laboratory, Institut de Recherches Cliniques de Montréal, University of Montreal, Montreal, Quebec, Canada [2] Ontario Cancer Institute, Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada.

PMID: 24522610
PMCID: PMC3923229
DOI: 10.1038/srep04092

Abstract

Gene set enrichment analysis (GSEA) associates gene sets and phenotypes, its use is predicated on the choice of a pre-defined collection of sets. The defacto standard implementation of GSEA provides seven collections yet there are no guidelines for the choice of collections and the impact of such choice, if any, is unknown. Here we compare each of the standard gene set collections in the context of a large dataset of drug response in human cancer cell lines. We define and test a new collection based on gene co-expression in cancer cell lines to compare the performance of the standard collections to an externally derived cell line based collection. The results show that GSEA findings vary significantly depending on the collection chosen for analysis. Henceforth, collections should be carefully selected and reported in studies that leverage GSEA.

PubMed Disclaimer

Figures

**Figure 1**
(A) Number and identity of gene sets identified as highly enriched (absolute normalized enrichment score > 2.0, maximum FDR < 25% across all drugs). (B) Heatmap of gene collection overlap score (g-index).

**Figure 2**
(A) Density plot representing the distribution of normalized enrichement scores for all drugs in each collection individually. (B) Heatmap of the number of highly enriched gene sets (absolute normalized enrichement score > 2.0, FDR < 25%) for each drug, in each collection. Gene set collections are listed along the bottom of the figure and drugs along the right. Darker hues of blue indicate a greater number of enriched gene sets for a particular drug.

**Figure 3**
(A) Fractional contribution of each collection to the set of top scoring gene sets with n gene sets per drug. n is plotted along the abscise. The ordinance shows the fraction of top gene sets contributed by each collection to the set of top scoring gene sets. As n increases, a higher number of gene sets per drug are assumed to be relevant or significant. Collection C2 is the highest contributor by a large margin, followed by C4, all other collections contribute to a negligible degree. The fractional contribution of C4 peaks before 10 top gene sets per drug, coinciding with C2's low. There is a slight trend downward in C4's contribution afterwards and a lesser trend upwards in the case of C2. (B) Fractional contribution of all Broad's collections plus our data-driven gene set collection, referred to as HGSK.

Figure 4. Creation of the HGSK set collection is done by creating a gene-gene distance measure based on the reciprocal of a gene-gene correlation matrix from the expression of tumour cell lines in the GSK data set.
Genes are clustered using traditional hierarchical clustering based on the distance measure. Depth first recursive tree generation is done, iterating over the prior sub-trees of cluster. Sets containing less than 15 genes or more than 500 are discarded.

**Figure 5. Overall analysis design used in our comparative study.**
First we calculated the overlap between each pair of gene set collections. Second we used a large pharmacogenomic dataset (CGP) to rank all the genes with respect to their association to response to each of the 138 drugs. Third we used these rankings together with the gene set collections to run multiple GSEA. Fourth the results are aggregated to compare the most enriched gene sets across collections. The results are then interpreted by taking into account the overlap between collections.

See this image and copyright information in PMC

References

1. Lockhart D. J. et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotech 14, 1675–1680 (1996). - PubMed
1. Shi L. et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 24, 1151–1161 (2006). - PMC - PubMed
1. Haibe-Kains B. et al. A Three-Gene Model to Robustly Identify Breast Cancer Molecular Subtypes. JNCI J. Natl. Cancer Inst. 104, 311–325 (2012). - PMC - PubMed
1. Hung J.-H., Yang T.-H., Hu Z., Weng Z. & DeLisi C. Gene set enrichment analysis: performance evaluation and usage guidelines. Brief. Bioinform. 13, 281–291 (2011). - PMC - PubMed
1. Maciejewski H. Gene set analysis methods: statistical models and methodological differences. Brief. Bioinform. 10.1093/bib/bbt002 (2013). - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Medical
- MedlinePlus Health Information
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Importance of collection in gene set enrichment analysis of drug response in cancer cell lines

Affiliations

Importance of collection in gene set enrichment analysis of drug response in cancer cell lines

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical

Molecular Biology Databases