. 2010 Mar 15;26(6):730-6.

doi: 10.1093/bioinformatics/btq040. Epub 2010 Feb 3.

SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors

Rodrigo Goya¹, Mark G F Sun, Ryan D Morin, Gillian Leung, Gavin Ha, Kimberley C Wiegand, Janine Senz, Anamaria Crisan, Marco A Marra, Martin Hirst, David Huntsman, Kevin P Murphy, Sam Aparicio, Sohrab P Shah

Affiliations

PMID: 20130035
PMCID: PMC2832826
DOI: 10.1093/bioinformatics/btq040

SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors

Rodrigo Goya et al. Bioinformatics. 2010.

. 2010 Mar 15;26(6):730-6.

doi: 10.1093/bioinformatics/btq040. Epub 2010 Feb 3.

Authors

Affiliation

¹ Department of Molecular Oncology Breast Cancer Research Program, British Columbia Cancer Research Centre, Vancouver, BC, Canada.

PMID: 20130035
PMCID: PMC2832826
DOI: 10.1093/bioinformatics/btq040

Abstract

Motivation: Next-generation sequencing (NGS) has enabled whole genome and transcriptome single nucleotide variant (SNV) discovery in cancer. NGS produces millions of short sequence reads that, once aligned to a reference genome sequence, can be interpreted for the presence of SNVs. Although tools exist for SNV discovery from NGS data, none are specifically suited to work with data from tumors, where altered ploidy and tumor cellularity impact the statistical expectations of SNV discovery.

Results: We developed three implementations of a probabilistic Binomial mixture model, called SNVMix, designed to infer SNVs from NGS data from tumors to address this problem. The first models allelic counts as observations and infers SNVs and model parameters using an expectation maximization (EM) algorithm and is therefore capable of adjusting to deviation of allelic frequencies inherent in genomically unstable tumor genomes. The second models nucleotide and mapping qualities of the reads by probabilistically weighting the contribution of a read/nucleotide to the inference of a SNV based on the confidence we have in the base call and the read alignment. The third combines filtering out low-quality data in addition to probabilistic weighting of the qualities. We quantitatively evaluated these approaches on 16 ovarian cancer RNASeq datasets with matched genotyping arrays and a human breast cancer genome sequenced to >40x (haploid) coverage with ground truth data and show systematically that the SNVMix models outperform competing approaches.

Availability: Software and data are available at http://compbio.bccrc.ca

Contact: sshah@bccrc.ca SUPPLEMANTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
(A) Schematic diagram of input data to SNVMix1. We show how allelic counts (bottom) are derived from aligned reads (top). The reference sequence is shown indicated in blue. The arrows indicate positions representing SNVs. The non-reference bases are shown in red. (B) Input data for SNVMix2 that consists of the mapping and base qualities. The darker the background for a read represents a higher quality alignment. The brighter colored nucleotides represent higher quality base calls. Therefore, high contrast nucleotides are more trustworthy than lower contrast nucleotides. (C) SNVMix1 shown as a probabilistic graphical model. Circles represent random variables, and rounded squares represent fixed constants. Shaded notes indicate observed data [the allelic counts and the read depth from (A)]. Unshaded nodes indicate quantities that are inferred during EM. G_i∈{aa, ab, bb} represents the genotype, N_i∈{0, 1,…,} is the number of reads and a_i∈{0, 1,…, N_i} is the number of reference reads. π is the prior over genotypes and μ_k is the genotype-specific Binomial parameter for genotype k. (D) SNVMix2 shown as a probabilistic graphical model. In comparison to SNVMix1, a_i is unobserved and we expand the input to consider read-specific information indexed by j where zⁱ_j=1 indicates that read j is correctly aligned, qⁱ_j is the base quality and rⁱ_j is the mapping quality.

**Fig. 2.**
(A) Theoretical behavior of SNVmix at depths of 2, 3, 5, 10, 15, 20, 35, 50 and 100. The plots show how the distribution of marginal probabilities changes with the number of reference alleles given the model parameters fit to a 10× breast cancer genome dataset. (B) ROC curves from fitting SNVMix2 to synthetic data with increasing levels of certainty in the base call.

**Fig. 3.**
Conditional probability distributions of SNVMix model.

**Fig. 4.**
Distribution of AUC over 16 ovarian cancer transcriptomes comparing accuracy of SNV detection for two Maq runs, the best and worst SNVMix1 runs in the cross-validation experiment (middle) and best and worst runs for SNVMix2 (mbQ0 = no quality thresholding, MbQ30 = keeping only reads with mapping qualities > Q30). SNVMix1 and SNVMix2 runs were statistically more accurate than both Maq runs (ANOVA, P < 0.0001). SNVMix2 runs were better than SNVMix1, but not statistically significantly.

See this image and copyright information in PMC

References

1. Jones S, et al. Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science. 2008;321:1801–1806. - PMC - PubMed
1. Langmead B, et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. - PMC - PubMed
1. Ley TJ, et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature. 2008;456:66–72. - PMC - PubMed
1. Li H, et al. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008;18:1851–1858. - PMC - PubMed
1. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

Canadian Institutes of Health Research/Canada

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors

Affiliation

SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous