. 2015 Sep 17;16(1):197.

doi: 10.1186/s13059-015-0758-2.

An ensemble approach to accurately detect somatic mutations using SomaticSeq

Li Tai Fang¹, Pegah Tootoonchi Afshar², Aparna Chhibber³, Marghoob Mohiyuddin⁴, Yu Fan⁵, John C Mu⁶, Greg Gibeling⁷, Sharon Barr⁸, Narges Bani Asadi⁹, Mark B Gerstein¹⁰, Daniel C Koboldt¹¹, Wenyi Wang¹², Wing H Wong^{13

14}, Hugo Y K Lam¹⁵

Affiliations

¹ Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. li\_tai.fang@bina.roche.com.
² Department of Electrical Engineering, Stanford University, Stanford, 94305, CA, USA. pegahta@stanford.edu.
³ Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. aparna.chhibber@bina.roche.com.
⁴ Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. marghoob.mohiyuddin@bina.roche.com.
⁵ Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, 77030, TX, USA. YFan1@mdanderson.org.
⁶ Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. john.mu@bina.roche.com.
⁷ Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. greg.gibeling@bina.roche.com.
⁸ Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. sharon.barr@bina.roche.com.
⁹ Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. narges.baniasadi@bina.roche.com.
¹⁰ Program in Computational Biology and Bioinformatics, Yale University, New Haven, 06520, CT, USA. mark.gerstein@yale.edu.
¹¹ The Genome Institute, Washington University in St Louis, St Louis, 63108, MO, USA. dkoboldt@genome.wustl.edu.
¹² Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, 77030, TX, USA. wwang7@mdanderson.org.
¹³ Department of Statistics, Stanford University, Stanford, 94305, CA, USA. whwong@stanford.edu.
¹⁴ Department of Health Research and Policy, Stanford University, Stanford, 94305, CA, USA. whwong@stanford.edu.
¹⁵ Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. hugo.lam@bina.roche.com.

PMID: 26381235
PMCID: PMC4574535
DOI: 10.1186/s13059-015-0758-2

An ensemble approach to accurately detect somatic mutations using SomaticSeq

Li Tai Fang et al. Genome Biol. 2015.

. 2015 Sep 17;16(1):197.

doi: 10.1186/s13059-015-0758-2.

Authors

Affiliations

¹ Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. li\_tai.fang@bina.roche.com.
² Department of Electrical Engineering, Stanford University, Stanford, 94305, CA, USA. pegahta@stanford.edu.
³ Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. aparna.chhibber@bina.roche.com.
⁴ Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. marghoob.mohiyuddin@bina.roche.com.
⁵ Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, 77030, TX, USA. YFan1@mdanderson.org.
⁶ Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. john.mu@bina.roche.com.
⁷ Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. greg.gibeling@bina.roche.com.
⁸ Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. sharon.barr@bina.roche.com.
⁹ Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. narges.baniasadi@bina.roche.com.
¹⁰ Program in Computational Biology and Bioinformatics, Yale University, New Haven, 06520, CT, USA. mark.gerstein@yale.edu.
¹¹ The Genome Institute, Washington University in St Louis, St Louis, 63108, MO, USA. dkoboldt@genome.wustl.edu.
¹² Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, 77030, TX, USA. wwang7@mdanderson.org.
¹³ Department of Statistics, Stanford University, Stanford, 94305, CA, USA. whwong@stanford.edu.
¹⁴ Department of Health Research and Policy, Stanford University, Stanford, 94305, CA, USA. whwong@stanford.edu.
¹⁵ Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. hugo.lam@bina.roche.com.

PMID: 26381235
PMCID: PMC4574535
DOI: 10.1186/s13059-015-0758-2

Abstract

SomaticSeq is an accurate somatic mutation detection pipeline implementing a stochastic boosting algorithm to produce highly accurate somatic mutation calls for both single nucleotide variants and small insertions and deletions. The workflow currently incorporates five state-of-the-art somatic mutation callers, and extracts over 70 individual genomic and sequencing features for each candidate site. A training set is provided to an adaptively boosted decision tree learner to create a classifier for predicting mutation statuses. We validate our results with both synthetic and real data. We report that SomaticSeq is able to achieve better overall accuracy than any individual tool incorporated.

PubMed Disclaimer

Figures

**Fig. 1**
SomaticSeq workflow. The workflow starts with FASTQ files for both the tumor and the matched normal sequencing reads, which are processed using Genome Analysis Toolkit (GATK) best practices to create two BAM files. The five somatic SNV callers (and three indel callers) are run on the pair of BAM files to generate mutation calls. Their results are merged, and then up to 72 features for each of the combined calls are generated from the BAM files using SAMtools and GATK HaplotypeCaller, as well as outputs from the callers themselves. The ensemble along with the feature set is then provided to the machine-learning model, which is trained with either a separate data set or a portion of these data. After training, the model calculates the probability for each call, yielding a high-confidence somatic mutation call set

**Fig. 2**
DREAM Challenge Stage 3 results trained from modified Stage 2 data. a Histogram of probability values (P) of all the mutation candidates in Stage 3. Higher probability values (closer to 1) imply that calls are more likely true somatic mutations. b The same plot with the y-axis in log10 scale. The overlaps can be seen. Keep in mind each unit in the y-axis is a tenfold increase. c An accuracy plot showing sensitivity, precision, and F ₁ scores vs. P cut-off

**Fig. 3**
F ₁ scores of SomaticSeq and the individual tools for the DREAM Challenge Stage 3 cross-validation. On the x-axes, Setting A is the pure normal/pure tumor. Setting B is the contaminated normal/pure tumor. Setting C is the pure normal/contaminated tumor. Setting D is the contaminated normal/contaminated tumor. a SNV results. b Indel results

**Fig. 4**
In silico titration of two human genomes. Blue represents reads from NA12878 (designated normal). Red represents reads from NS12911 (designated tumor). Going from (a) to (b) represents a somatic mutation of G >A, where G in the normal is a homozygous reference and A in the tumor is a heterozygous variant. c A normal contaminated with tumor tissues. d A tumor sample contaminated with normal tissues

**Fig. 5**
Obtaining the ground truth for the in silico tumor–normal data. In the NA12878 and NS12911 mixture, there are a total of 746,280 virtual somatic SNVs and 64,399 virtual somatic indels. A total of 2.2 billion high-confidence sites are interrogated (the remaining are ignored). During our analyses, a somatic mutation rate of one out of a million was enforced to represent a realistic prior probability of somatic mutations

**Fig. 6**
F ₁ scores of SomaticSeq and the individual tools in in silico titration. Color legends are shown in Fig. 3. On the x-axes, the subscript denotes the expected VAF as a percentage, i.e., N ₀ T ₅₀ means the normal has VAF = 0 % (i.e., pure normal) and the tumor has VAF = 50 %. N _2.5 T ₁₅ represents a challenging data set where VAF = 2.5 % for normal and VAF = 15 % for tumor, i.e., 5 % of the normal sample is contaminated with tumor tissues and 30 % of the tumor sample is contaminated with normal tissues. a SNV results. b Indel results

**Fig. 7**
F ₁ scores vs. VAF for different coverage depths

**Fig. 8**
SomaticSeq performance on real data. a The sensitivity of SomaticSeq as a function of P cut-offs. b The call set size as a function of P cut-offs, normalized to the call set size at P=0.7, i.e., the ratio between the call set size at a given P and the call set size at P=0.7 (default cut-off in this study)

**Fig. 9**
CADD scores from SomaticSeq’s PASS (high confidence) calls vs. LowQual (medium confidence) vs. REJECT (likely false positive) calls. a COLO-829. b CLL1. Only non-synonymous SNVs were evaluated. The p-values were calculated from a two-sided Wilcoxon rank-sum test

**Fig. 10**
Machine learning: a training set with ground truth is provided to the machine-learning algorithm to create an adaptively boosted classifier. The classifier is applied to a target set to create a high-confidence somatic mutation call set. The call set is compared to the ground truth or the validated mutation list of the target set to calculate the accuracy (only sensitivity is calculated for real data)

See this image and copyright information in PMC

References

1. Wang Q, Jia P, Li F, Chen H, Ji H, Hucks D, et al. Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers. Genome Med. 2013;5:91. doi: 10.1186/gm495. - DOI - PMC - PubMed
1. Roberts ND, Kortschak RD, Parker WT, Schreiber AW, Branford S, Scott HS, et al. A comparative analysis of algorithms for somatic SNV detection in cancer. Bioinformatics. 2013;29:2223–30. doi: 10.1093/bioinformatics/btt375. - DOI - PMC - PubMed
1. Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol. 2013;31:213–9. doi: 10.1038/nbt.2514. - DOI - PMC - PubMed
1. Larson DE, Harris CC, Chen K, Koboldt DC, Abbott TE, Dooling DJ, et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics. 2012;28:311–17. doi: 10.1093/bioinformatics/btr665. - DOI - PMC - PubMed
1. Roth A, Ding J, Morin R, Crisan A, Ha G, Giuliany R, et al. JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics. 2012;28:907–13. doi: 10.1093/bioinformatics/bts053. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An ensemble approach to accurately detect somatic mutations using SomaticSeq

Affiliations

An ensemble approach to accurately detect somatic mutations using SomaticSeq

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources