Challenges in identifying cancer genes by analysis of exome sequencing data

Matan Hofree^{1

2

3}, Hannah Carter^{2

3

4}, Jason F Kreisberg^{1

3}, Sourav Bandyopadhyay⁵, Paul S Mischel⁶, Stephen Friend⁷, Trey Ideker^{1

2

3

4}

Affiliations

¹ Cancer Cell Map Initiative (CCMI), 9500 Gilman Drive, La Jolla, California 92093, USA.
² Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, California 92093, USA.
³ Department of Medicine, University of California San Diego, 9500 Gilman Drive, La Jolla, California 92093, USA.
⁴ Moores Cancer Center, University of California San Diego, 3855 Health Sciences Drive, La Jolla, California 92093, USA.
⁵ Diller Family Comprehensive Cancer Center, University of California San Francisco, 1600 Divisadero Street, San Francisco, California 94115, USA.
⁶ Ludwig Institute for Cancer Research, University of California San Diego, 9500 Gilman Drive, La Jolla, California 92093, USA.
⁷ Sage Bionetworks, Seattle, 110 Fairview Avenue North, Seattle, Washington 98109, USA.

PMID: 27417679
PMCID: PMC4947162
DOI: 10.1038/ncomms12096

Challenges in identifying cancer genes by analysis of exome sequencing data

Matan Hofree et al. Nat Commun. 2016.

. 2016 Jul 15:7:12096.

doi: 10.1038/ncomms12096.

Authors

Matan Hofree^{1

2

3}, Hannah Carter^{2

3

4}, Jason F Kreisberg^{1

3}, Sourav Bandyopadhyay⁵, Paul S Mischel⁶, Stephen Friend⁷, Trey Ideker^{1

2

3

4}

Affiliations

¹ Cancer Cell Map Initiative (CCMI), 9500 Gilman Drive, La Jolla, California 92093, USA.
² Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, California 92093, USA.
³ Department of Medicine, University of California San Diego, 9500 Gilman Drive, La Jolla, California 92093, USA.
⁴ Moores Cancer Center, University of California San Diego, 3855 Health Sciences Drive, La Jolla, California 92093, USA.
⁵ Diller Family Comprehensive Cancer Center, University of California San Francisco, 1600 Divisadero Street, San Francisco, California 94115, USA.
⁶ Ludwig Institute for Cancer Research, University of California San Diego, 9500 Gilman Drive, La Jolla, California 92093, USA.
⁷ Sage Bionetworks, Seattle, 110 Fairview Avenue North, Seattle, Washington 98109, USA.

PMID: 27417679
PMCID: PMC4947162
DOI: 10.1038/ncomms12096

Abstract

Massively parallel sequencing has permitted an unprecedented examination of the cancer exome, leading to predictions that all genes important to cancer will soon be identified by genetic analysis of tumours. To examine this potential, here we evaluate the ability of state-of-the-art sequence analysis methods to specifically recover known cancer genes. While some cancer genes are identified by analysis of recurrence, spatial clustering or predicted impact of somatic mutations, many remain undetected due to lack of power to discriminate driver mutations from the background mutational load (13-60% recall of cancer genes impacted by somatic single-nucleotide variants, depending on the method). Cancer genes not detected by mutation recurrence also tend to be missed by all types of exome analysis. Nonetheless, these genes are implicated by other experiments such as functional genetic screens and expression profiling. These challenges are only partially addressed by increasing sample size and will likely hold even as greater numbers of tumours are analysed.

PubMed Disclaimer

Figures

**Figure 1. Original experimental techniques used to identify currently known cancer genes.**
(a) Shown is the cumulative number of cancer genes known to be perturbed by somatic single-nucleotide variations, as recorded in the COSMIC CGC, according to the year of first cancer-related publication indexed in PubMed. Each bar is coloured by the experimental technique categories used by these first publications. In parenthesis is the number of genes associated with each experimental category as of 2013. (b) Proportion of the different types of somatic alteration included in the CGC. In blue are the proportions for all somatically altered genes; in green are the same proportions for genes also known to have single-nucleotide alterations.

**Figure 2. Performance of methods.**
Heatmaps showing the (a) recall and (b) precision of each method (rows) tested against each positive cancer reference set (columns). Dashed box highlights the performance of MAIN-METHODS on the CGC-SNV reference set. To compute precision, we assume the proportion of cancer genes is 5% of all human genes; precision values for other proportions are shown in Supplementary Fig. 1 with qualitatively similar results. (c) Precision/recall plot detailing results from a and b for CGC-SNV cancer genes. (d) Summary of CGC-SNV genes curated for particular cancer tissues versus their cancer detection status based on genome analysis by four different methods and their union. (e) Count of CGC-SNV genes as a function of the number of cancer tissue types in which each gene has been detected thus far.

**Figure 3. Experimental support for reference cancer gene lists.**
(a–c) Support for CGC cancer genes detected by any of the MAIN-METHODS for analysing tumour genomes (Cancer Detected) versus those cancer genes that were undetected by any of these (cancer undetected). Also shown is support for the AGO-NEG negative control set of non-cancer genes (Likely non-cancer) and the remainder of genes in the genome-wide background (all other genes). Whisker plots indicate mean and the 95% confidence interval of the mean. Support is evaluated using: (a) RNA-seq tumour-normal differential expression in The Cancer Genome Atlas (TCGA). (b) Number of times a gene has been identified in independent cancer genetic screens in mice. (c) Number of Project Achilles cell lines with a measured impact (top/bottom 10%) on growth as a result of shRNA knockdown. An asterisk (*) indicates a significant difference in medians was found between the two sets. (d) The number of cancer publications by year comparing detected and undetected CGC cancer genes.

**Figure 4. Power to detect recurrently mutated genes as the number of tumour exomes increases.**
(a) Number of patient samples (y axis) necessary for detecting a cancer gene, as a function of the background somatic mutation rate of the tissue (x axis) and the fold increase in mutation rate of the cancer gene above this background (coloured lines). The total 10-year U.S. incidences of major cancer types are indicated (grey circles with horizontal bars), along with the number of patients currently sequenced as listed by the ICGC database v20 (dotted circles). (b) Mutated genes of a single breast adenocarcinoma patient, ranked by mutation frequency within tumours of this tissue type. (c) Same analysis showing the median behaviour for 881 The Cancer Genome Atlas (TCGA) patients with breast cancer. Mutated genes in each patient are ranked by mutation frequency; the median mutation frequency over all patients is plotted for each percentile.

See this image and copyright information in PMC

References

1. Pleasance E. D. et al.. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 463, 191–196 (2010). - PMC - PubMed
1. Ley T. J. et al.. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456, 66–72 (2008). - PMC - PubMed
1. Hudson T. J. et al.. International network of cancer genome projects. Nature 464, 993–998 (2010). - PMC - PubMed
1. Hodis E. et al.. A landscape of driver mutations in melanoma. Cell 150, 251–263 (2012). - PMC - PubMed
1. Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061–1068 (2008). - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Challenges in identifying cancer genes by analysis of exome sequencing data

Affiliations

Challenges in identifying cancer genes by analysis of exome sequencing data

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources