Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Apr 12:15:104.
doi: 10.1186/1471-2105-15-104.

BAYSIC: a Bayesian method for combining sets of genome variants with improved specificity and sensitivity

Affiliations

BAYSIC: a Bayesian method for combining sets of genome variants with improved specificity and sensitivity

Brandi L Cantarel et al. BMC Bioinformatics. .

Abstract

Background: Accurate genomic variant detection is an essential step in gleaning medically useful information from genome data. However, low concordance among variant-calling methods reduces confidence in the clinical validity of whole genome and exome sequence data, and confounds downstream analysis for applications in genome medicine.Here we describe BAYSIC (BAYeSian Integrated Caller), which combines SNP variant calls produced by different methods (e.g. GATK, FreeBayes, Atlas, SamTools, etc.) into a more accurate set of variant calls. BAYSIC differs from majority voting, consensus or other ad hoc intersection-based schemes for combining sets of genome variant calls. Unlike other classification methods, the underlying BAYSIC model does not require training using a "gold standard" of true positives. Rather, with each new dataset, BAYSIC performs an unsupervised, fully Bayesian latent class analysis to estimate false positive and false negative error rates for each input method. The user specifies a posterior probability threshold according to the user's tolerance for false positive and false negative errors; lowering the posterior probability threshold allows the user to trade specificity for sensitivity while raising the threshold increases specificity in exchange for sensitivity.

Results: We assessed the performance of BAYSIC in comparison to other variant detection methods using ten low coverage (~5X) samples from The 1000 Genomes Project, a tumor/normal exome pair (40X), and exome sequences (40X) from positive control samples previously identified to contain clinically relevant SNPs. We demonstrated BAYSIC's superior variant-calling accuracy, both for somatic mutation detection and germline variant detection.

Conclusions: BAYSIC provides a method for combining sets of SNP variant calls produced by different variant calling programs. The integrated set of SNP variant calls produced by BAYSIC improves the sensitivity and specificity of the variant calls used as input. In addition to combining sets of germline variants, BAYSIC can also be used to combine sets of somatic mutations detected in the context of tumor/normal sequencing experiments.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Contingency table and posterior probabilities for SNP variant detection programs. Variants were detected jointly on ten samples from The 1000 Genomes Project using FreeBayes, SamTools, GATK, and Atlas as described in Methods. For each possible combination of agreement amongst the variant calling programs and dbSNP, the observed number of SNP variant positions and the posterior probability calculated by BAYSIC is shown.
Figure 2
Figure 2
Overview of BAYSIC algorithm. Sets of variant calls produced from one or more programs are input in VCF format. Optionally, variants from third party databases are included as additional sources of information, e.g. dbSNP for normal variant calling and COSMIC for somatic mutation calling. False positive and false negative error rates are estimated using Markov Chain Monte Carlo simulation, and a posterior probability is calculated for each possible combination of agreement between the variant calling programs (see Methods). Finally, variants whose posterior probability is greater than the cutoff specified by the user (default value = 0.8) are output to generate a set of integrated variant calls.
Figure 3
Figure 3
Agreement amongst variant calling programs. Variants were detected jointly on ten samples from The 1000 Genomes Project using FreeBayes, SamTools, GATK, and Atlas, as in Figure 1. A. For each SNP, agreement amongst the variant calling programs was calculated. The number of SNPs detected by each of the programs is indicated by the number in the enclosing ellipses. B. Agreement amongst the variant calling programs displayed as a barplot. A = Atlas, F = FreeBayes, G = GATK, S = Samtools. Left-hand y-axis indicates number of SNPs detected by the programs denoted on the x-axis. Right-hand y-axis indicates the percent of all SNPs detected by the programs denoted on the x-axis.
Figure 4
Figure 4
Sensitivity and specificity of BAYSIC and other variant calling programs. A. Improvement of sensitivity and specificity of BAYSIC compared with input variant calling programs used with default parameters. SNP variants were detected jointly on ten samples from The 1000 Genomes Project using FreeBayes, SamTools, GATK (low quality filtered) and Atlas as in Figure 1, and the four variant sets and dbSNP were combined using BAYSIC. Sensitivity of each of the variant calling programs and BAYSIC was measured as the percent of SNPs confirmed by an orthogonal platform (SNP-chip) that was detected by the given program. Specificity was measured as the transition/transversion ratio (Ti/Tv) of all SNP variants called by each program. The sensitivity and specificity for SNPs in coding (top) and non-coding regions (bottom) are shown. Numbers accompanying black symbols indicate posterior probability cutoff used for generating the BAYSIC integrated variant sets. Horizontal dashed line indicates the specificity of the intersection of the four sets of variant predictions produced by FreeBayes, SamTools, GATK and Atlas. Vertical dashed line indicates sensitivity of the union of the four sets of variant predictions produced by FreeBayes, SamTools, GATK and Atlas. B. BAYSIC sensitivity and specificity compared with variant calling programs with continuous estimates of variant probability. Variants were detected using FreeBayes and GATK with varying stringency by applying cutoffs based on quality scores (for FreeBayes) or either Tranche scores or VQSLOD scores (for GATK). Sensitivity and specificity are shown for FreeBayes with cutoffs of Q10, Q20 (blue points) and GATK with Tranche cutoffs (open purple points, no cutoff, Tranche90, Tranche99 and Tranche99.9) or VQSLOD cutoffs (closed purple points, VQSLOD cutoffs of 0, 2.9, 4.4 or 5.4 from left to right). Sensitivity and specificity of BAYSIC using FreeBayes, Samtools, GATK and Atlas with default parameters as input are shown for comparison.
Figure 5
Figure 5
Effect of variant calling programs used as input on sensitivity and specificity of BAYSIC. SNP variants were detected with BAYSIC using as input all possible combinations FreeBayes, SamTools, GATK and Atlas. The sensitivity and specificity of each set was then measured as in Figure 4 for SNPs occurring in coding regions (top) or non-coding regions (bottom). The sensitivity and specificity for each combination of input call sets and for a range of posterior probability cutoffs is shown.
Figure 6
Figure 6
Combining somatic mutation calls from tumor/normal pair samples using BAYSIC. Somatic mutations from exome data from a single patient were predicted using Mutect, Strelka, Varscan2 and Shimmer and these four sets of somatic mutation calls were combined using BAYSIC with a posterior probability cutoff of 0.8. As a measure of sensitivity, the number of somatic mutations predicted by each caller that are present in COSMIC, a database of somatic mutation calls from other samples, is shown (top). As a measure of specificity, the percent of each set of somatic mutation calls that are present in COSMIC is shown (bottom). A horizontal dashed line indicating the percent of BAYSIC somatic mutational calls present in COSMIC is shown.

References

    1. Martin ADG, Kamm T, Ordowski M, Przybocki M. The DET curve in assessment of detection task performance. Proc Eurospeech. 1899–1903;1997:4.
    1. Dewey FE, Grove ME, Pan C, Goldstein BA, Bernstein JA, Chaib H, Merker JD, Goldfeder RL, Enns GM, David SP, Pakdaman N, Ormond KE, Caleshu C, Kingham K, Klein TE, Whirl-Carrillo M, Sakamoto K, Wheeler MT, Butte AJ, Ford JM, Boxer L, Ioannidis JP, Yeung AC, Altman RB, Assimes TL, Snyder M, Ashley EA Quertermous T. Clinical interpretation and implications of whole-genome sequencing. JAMA. 2014;311(10):1035–1045. doi: 10.1001/jama.2014.1717. - DOI - PMC - PubMed
    1. Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, Salit M. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol. 2014;32:246–251. doi: 10.1038/nbt.2835. - DOI - PubMed
    1. Gerstung M, Papaemmanuil E, Campbell PJ. Subclonal variant calling with multiple samples and prior knowledge. Bioinformatics. 2014. doi:10.1093/bioinformatics/btt750. - PMC - PubMed
    1. Lupski JR, Gonzaga-Jauregui C, Yang Y, Bainbridge MN, Jhangiani S, Buhay CJ, Kovar CL, Wang M, Hawes AC, Reid JG, Eng C, Muzny DM, Gibbs RA. Exome sequencing resolves apparent incidental findings and reveals further complexity of SH3TC2 variant alleles causing Charcot-Marie-Tooth neuropathy. Genome Med. 2013;5(6):57. - PMC - PubMed