Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data

Jiarui Ding¹, Ali Bashashati, Andrew Roth, Arusha Oloumi, Kane Tse, Thomas Zeng, Gholamreza Haffari, Martin Hirst, Marco A Marra, Anne Condon, Samuel Aparicio, Sohrab P Shah

Affiliations

PMID: 22084253
PMCID: PMC3259434
DOI: 10.1093/bioinformatics/btr629

Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data

Jiarui Ding et al. Bioinformatics. 2012.

. 2012 Jan 15;28(2):167-75.

doi: 10.1093/bioinformatics/btr629. Epub 2011 Nov 13.

Authors

Jiarui Ding¹, Ali Bashashati, Andrew Roth, Arusha Oloumi, Kane Tse, Thomas Zeng, Gholamreza Haffari, Martin Hirst, Marco A Marra, Anne Condon, Samuel Aparicio, Sohrab P Shah

Affiliation

¹ Department of Molecular Oncology, BC Cancer Agency, Vancouver, BC, Canada.

PMID: 22084253
PMCID: PMC3259434
DOI: 10.1093/bioinformatics/btr629

Abstract

Motivation: The study of cancer genomes now routinely involves using next-generation sequencing technology (NGS) to profile tumours for single nucleotide variant (SNV) somatic mutations. However, surprisingly few published bioinformatics methods exist for the specific purpose of identifying somatic mutations from NGS data and existing tools are often inaccurate, yielding intolerably high false prediction rates. As such, the computational problem of accurately inferring somatic mutations from paired tumour/normal NGS data remains an unsolved challenge.

Results: We present the comparison of four standard supervised machine learning algorithms for the purpose of somatic SNV prediction in tumour/normal NGS experiments. To evaluate these approaches (random forest, Bayesian additive regression tree, support vector machine and logistic regression), we constructed 106 features representing 3369 candidate somatic SNVs from 48 breast cancer genomes, originally predicted with naive methods and subsequently revalidated to establish ground truth labels. We trained the classifiers on this data (consisting of 1015 true somatic mutations and 2354 non-somatic mutation positions) and conducted a rigorous evaluation of these methods using a cross-validation framework and hold-out test NGS data from both exome capture and whole genome shotgun platforms. All learning algorithms employing predictive discriminative approaches with feature selection improved the predictive accuracy over standard approaches by statistically significant margins. In addition, using unsupervised clustering of the ground truth 'false positive' predictions, we noted several distinct classes and present evidence suggesting non-overlapping sources of technical artefacts illuminating important directions for future study.

Availability: Software called MutationSeq and datasets are available from http://compbio.bccrc.ca.

PubMed Disclaimer

Figures

**Fig. 1.**
(a) Accuracy results from cross-validation experiments on all the exome capture data (SeqVal1+2). All classifiers showed better results than Samtools and GATK's prediction results in terms of ROC comparison. The numbers in parentheses are the prediction accuracy by fixing the sensitivity at 0.99, except for Samtools and GATK's prediction results because their outputs are deterministic. (b) Accuracy results from cross-validation experiments on the exome capture data of SeqVal1. (c) Accuracy results from cross-validation experiments on the exome capture data of SeqVal1 after GATK's local realignment around indels and base quality recalibration. (d) Accuracy results from cross-validation experiments on the exome capture data of SeqVal2. (e) Comparison of classifiers and Samtools's (ST) performance at the specificity and sensitivity level given by Samtools. (f) Comparison of classifiers and GATK's performance at the specificity and sensitivity level given by GATK.

**Fig. 2.**
ROC curves derived from the held-out whole genome shotgun independent test data from four cases show different classifiers' prediction results as well as Samtools and GATK's prediction results. The numbers in the parentheses are the prediction accuracy by using the same threshold as for the exome capture data (except for Samtools and GATK's prediction results).

See this image and copyright information in PMC

References

1. Abeel T., et al. Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics. 2010;26:392–398. - PubMed
1. Altmann A., et al. vipR: variant identification in pooled DNA using R. Bioinformatics. 2011;27:i77–i84. - PMC - PubMed
1. Barnett D., et al. Bamtools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics. 2011;27:1691–1692. - PMC - PubMed
1. Chapman M., et al. Initial genome sequencing and analysis of multiple myeloma. Nature. 2011;471:467–472. - PMC - PubMed
1. Chipman H., et al. BART: Bayesian additive regression trees. Ann. Appl. Stat. 2010;4:266–298.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

202452/Canadian Institutes of Health Research/Canada

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Medical
- MedlinePlus Health Information
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data

Affiliation

Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical

Research Materials