Expanding the computational toolbox for mining cancer genomes

Li Ding¹, Michael C Wendl², Joshua F McMichael³, Benjamin J Raphael⁴

Affiliations

¹ 1] The Genome Institute, Washington University in St. Louis, 4444 Forest Park Ave., St. Louis, Missouri 63108, USA. [2] Department of Medicine, Washington University in St. Louis, 660 S. Euclid Ave., St. Louis, Missouri 63110, USA. [3] Department of Genetics, Washington University in St. Louis, 660 S. Euclid Ave., St. Louis, Missouri 63110, USA. [4] Siteman Cancer Center, Washington University in St. Louis, 4921 Parkview Place, St. Louis, Missouri 63110, USA.
² 1] The Genome Institute, Washington University in St. Louis, 4444 Forest Park Ave., St. Louis, Missouri 63108, USA. [2] Department of Genetics, Washington University in St. Louis, 660 S. Euclid Ave., St. Louis, Missouri 63110, USA. [3] Department of Mathematics, Washington University in St. Louis, 1 Brookings Drive, St. Louis, Missouri 63130, USA.
³ The Genome Institute, Washington University in St. Louis, 4444 Forest Park Ave., St. Louis, Missouri 63108, USA.
⁴ Department of Computer Science and Center for Computational Molecular Biology, Brown University, 115 Waterman Street, Providence, Rhode Island 02912, USA.

PMID: 25001846
PMCID: PMC4168012
DOI: 10.1038/nrg3767

Review

Expanding the computational toolbox for mining cancer genomes

Li Ding et al. Nat Rev Genet. 2014 Aug.

. 2014 Aug;15(8):556-70.

doi: 10.1038/nrg3767. Epub 2014 Jul 8.

Authors

Li Ding¹, Michael C Wendl², Joshua F McMichael³, Benjamin J Raphael⁴

Affiliations

¹ 1] The Genome Institute, Washington University in St. Louis, 4444 Forest Park Ave., St. Louis, Missouri 63108, USA. [2] Department of Medicine, Washington University in St. Louis, 660 S. Euclid Ave., St. Louis, Missouri 63110, USA. [3] Department of Genetics, Washington University in St. Louis, 660 S. Euclid Ave., St. Louis, Missouri 63110, USA. [4] Siteman Cancer Center, Washington University in St. Louis, 4921 Parkview Place, St. Louis, Missouri 63110, USA.
² 1] The Genome Institute, Washington University in St. Louis, 4444 Forest Park Ave., St. Louis, Missouri 63108, USA. [2] Department of Genetics, Washington University in St. Louis, 660 S. Euclid Ave., St. Louis, Missouri 63110, USA. [3] Department of Mathematics, Washington University in St. Louis, 1 Brookings Drive, St. Louis, Missouri 63130, USA.
³ The Genome Institute, Washington University in St. Louis, 4444 Forest Park Ave., St. Louis, Missouri 63108, USA.
⁴ Department of Computer Science and Center for Computational Molecular Biology, Brown University, 115 Waterman Street, Providence, Rhode Island 02912, USA.

PMID: 25001846
PMCID: PMC4168012
DOI: 10.1038/nrg3767

Abstract

High-throughput DNA sequencing has revolutionized the study of cancer genomics with numerous discoveries that are relevant to cancer diagnosis and treatment. The latest sequencing and analysis methods have successfully identified somatic alterations, including single-nucleotide variants, insertions and deletions, copy-number aberrations, structural variants and gene fusions. Additional computational techniques have proved useful for defining the mutations, genes and molecular networks that drive diverse cancer phenotypes and that determine clonal architectures in tumour samples. Collectively, these tools have advanced the study of genomic, transcriptomic and epigenomic alterations in cancer, and their association to clinical properties. Here, we review cancer genomics software and the insights that have been gained from their application.

PubMed Disclaimer

Figures

**Box 1 Figure. Data requirements for capturing heterozygous variants**
Identifying a single-nucleotide variant (SNV) requires its observation in multiple reads, usually at least 3, but accrual of these reads is governed by the random dynamics of sampling and coverage, quantified in the ideal case (pure samples, perfect data, and no sequence bias) by Eq. (1) for various tumour mass fractions. Data requirements are pushed appreciably higher by subclones that comprise smaller fractions of the entire tumour mass. Red triangle indicates redundancy of 340X for 99% probability of observing ≥3 reads in a 5% subclone.

**Box 2 Figure. Environmental factor contributing to cancer risk**
Smoking, viruses, and radiation can strongly affect mutation rates across the cancer genome and mutation profiles across cancer types and human populations. Signatures of these effects can often be detected in tumour genome sequences.

**Figure 1. Sample procurement, sequencing, and analysis roadmap**
(A) Sequencing strategy: Most cancer genomics investigations sequence the genome of a tumour sample from primary or metastatic lesion, starting with a non-specific ‘global’ sample pooled from biopsy or resection. Because the spatial distribution of any resident subclones is not known *a priori*, it will become increasingly common to sequence specific regions from a tumor section separately. In the limit, single-cell sequencing can also be performed on flow-sorted nuclei to assess cellular diversity (B) Overview of the sequencing and analysis process: tumour and adjacent healthy tissue samples are sequenced using high-throughput instruments to obtain genome, exome, RNA and other types of data. After alignment, a battery of detection tools identifies both small (SNV, indel) and large (copy number, structural variation, gene fusions) alterations, which are then annotated and analyzed individually (Level I) —for example, for likely functional implications — and collectively (Level II) —for example, to identify relevant gene pathways and networks

**Figure 2. Biological factors relevant to assessing significant genes in cancer**
Genomic analysis establishes mutation frequencies of genes and helps characterize background mutation rates. Specific mutation hot spots have been found in the various cancer types. Other factors have also been shown to affect the background mutation rate of a gene, including gene length, expression level, and replication timing. State-of-the-art tools, such as MuSiC and MutSig give proper consideration to these and many other factors, for example transition versus transversion frequency, in determining the significantly mutated genes that contribute substantively to cancer initiation and progression.

**Figure 3. Significantly mutated genes, pathways and networks**
Given the mutational status of genes across multiple patients, one can distinguish driver from passenger mutations using several strategies. Single-gene tests determine whether the observed number of samples having a mutation in the gene is significantly greater than what is expected under an appropriate null model. Pathway or gene set approaches examine whether multiple genes in pre-defined sets, as obtained for example from a curated database like KEGG, GO, or MSigDB, have more mutations than expected. These tests are biased to the prior knowledge of gene cascades residing in these databases, but the numbers of tests are relatively small, so the risks associated with Type I error **[G]** tend to be manageable. Conversely, network approaches rely only on knowledge of known protein-protein or protein-DNA interactions in examining combinations of mutations on whole-genome interaction networks, for example using the analog of heat diffusion. Because these approaches are unbiased, they furnish the possibility of inferring novel combinations of genes relevant to cancer, but larger numbers of hypothesis tests imply that greater care must be taken for multiple testing correction.

**Figure 4. Conceptual example of clonal evolution model and clonality analysis**
(A) The founding clone (yellow) persists during the course of the disease. Another clone (green) present at time point 1 faces extinction before time point 2, but new subclones (blue/time point 2 and orange/time point 3) emerge during disease progression. (B) SciClone algorithm detects the three mutation clusters present at time point 3.

See this image and copyright information in PMC

References

1. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A. 1977;74:5463–5467. - PMC - PubMed
1. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. 1977. Biotechnology. 1992;24:104–108. - PubMed
1. Lander ES, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. - DOI - PubMed
1. Ley TJ, et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature. 2008;456:66–72. doi: 10.1038/nature07485. - DOI - PMC - PubMed
1. Shendure J, Lieberman Aiden E. The expanding scope of DNA sequencing. Nat Biotechnol. 2012;30:1084–1094. doi: 10.1038/nbt.2421. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Expanding the computational toolbox for mining cancer genomes

Affiliations

Expanding the computational toolbox for mining cancer genomes

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources