Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Oct;30(19):2787-95.
doi: 10.1093/bioinformatics/btu345. Epub 2014 Jun 3.

SMaSH: a benchmarking toolkit for human genome variant calling

Affiliations

SMaSH: a benchmarking toolkit for human genome variant calling

Ameet Talwalkar et al. Bioinformatics. 2014 Oct.

Abstract

Motivation: Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad hoc and incomplete. Agreement on benchmarking variant calling methods would stimulate development of genomic processing tools and facilitate communication among researchers.

Results: We propose SMaSH, a benchmarking methodology for evaluating germline variant calling algorithms. We generate synthetic datasets, organize and interpret a wide range of existing benchmarking data for real genomes and propose a set of accuracy and computational performance metrics for evaluating variant calling methods on these benchmarking data. Moreover, we illustrate the utility of SMaSH to evaluate the performance of some leading single-nucleotide polymorphism, indel and structural variant calling algorithms.

Availability and implementation: We provide free and open access online to the SMaSH tool kit, along with detailed documentation, at smash.cs.berkeley.edu

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
An ‘ideal’ benchmarking dataset satisfies three properties: it contains real reads (R), it includes comprehensive validation of the underlying genome (C) and its underlying genome is human (H). SMaSH contains three types of benchmarking datasets, each of which satisfies two of the three desirable properties of an ideal dataset, so as to cover all three properties
Fig. 2.
Fig. 2.
Schematic illustrating process by which SMaSH’s first ‘Mouse’ dataset is generated. (a) Our ideal setup in which the B6 strain (with comprehensive validation and corresponding short reads) serves as the sample and the DBA strain serves as the reference. (b) Publicly available data (note that the B6 validation data are the canonical mouse reference). (c) Construction of an approximate DBA validation set (the ‘fake’ reference) by leveraging a rough set of variants for the DBA strain called relative to the canonical reference

References

    1. Albers CA, et al. Dindel: accurate indel calls from short-read data. Genome Res. 2011;21:961–973. - PMC - PubMed
    1. Alkan C, et al. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 2011;12:363–376. - PMC - PubMed
    1. Chen K, et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat. Methods. 2009;6:677–681. - PMC - PubMed
    1. Church DM, et al. Lineage-specific biology revealed by a finished genome assembly of the mouse. PLoS Biol. 2009;7:e1000112. - PMC - PubMed
    1. The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. - PMC - PubMed

Publication types