Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb 22;23(1):155.
doi: 10.1186/s12864-022-08365-3.

Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery

Affiliations

Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery

Yury A Barbitoff et al. BMC Genomics. .

Abstract

Background: Accurate variant detection in the coding regions of the human genome is a key requirement for molecular diagnostics of Mendelian disorders. Efficiency of variant discovery from next-generation sequencing (NGS) data depends on multiple factors, including reproducible coverage biases of NGS methods and the performance of read alignment and variant calling software. Although variant caller benchmarks are published constantly, no previous publications have leveraged the full extent of available gold standard whole-genome (WGS) and whole-exome (WES) sequencing datasets.

Results: In this work, we systematically evaluated the performance of 4 popular short read aligners (Bowtie2, BWA, Isaac, and Novoalign) and 9 novel and well-established variant calling and filtering methods (Clair3, DeepVariant, Octopus, GATK, FreeBayes, and Strelka2) using a set of 14 "gold standard" WES and WGS datasets available from Genome In A Bottle (GIAB) consortium. Additionally, we have indirectly evaluated each pipeline's performance using a set of 6 non-GIAB samples of African and Russian ethnicity. In our benchmark, Bowtie2 performed significantly worse than other aligners, suggesting it should not be used for medical variant calling. When other aligners were considered, the accuracy of variant discovery mostly depended on the variant caller and not the read aligner. Among the tested variant callers, DeepVariant consistently showed the best performance and the highest robustness. Other actively developed tools, such as Clair3, Octopus, and Strelka2, also performed well, although their efficiency had greater dependence on the quality and type of the input data. We have also compared the consistency of variant calls in GIAB and non-GIAB samples. With few important caveats, best-performing tools have shown little evidence of overfitting.

Conclusions: The results show surprisingly large differences in the performance of cutting-edge tools even in high confidence regions of the coding genome. This highlights the importance of regular benchmarking of quickly evolving tools and pipelines. We also discuss the need for a more diverse set of gold standard genomes that would include samples of African, Hispanic, or mixed ancestry. Additionally, there is also a need for better variant caller assessment in the repetitive regions of the coding genome.

Keywords: Benchmark; Genome in a Bottle; Performance comparison; Pipeline; Variant calling; Whole exome sequencing; Whole genome sequencing.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Systematic benchmarking of multiple variant calling pipelines. a A chart representing the analysis workflow. b A scatterplot showing mean coverage of high-confidence coding sequence regions (defined by the Genome In A Bottle consortium) and the fraction of bases of such regions covered at least 10x total read depth in WGS and WES datasets used (each point corresponds to an individual sample). c Reciprocal overlap of GENCODE v19 CDS intervals, GIAB v. 4.2 and GIAB v. 3.3 high confidence regions common for all 7 samples, and pathogenic/likely pathogenic variants without conflicting reports from ClinVar (ClinVar v. 20,211,130 was used)
Fig. 2
Fig. 2
Statistical comparison of variant discovery pipelines’ performance. A Box plots representing the F1 scores for different combinations of aligners and variant callers. B - E Pairwise comparison of tool performance for read aligners (B, C) and variant callers (D, E) using pass-filter SNP (B, D) and indel (C, E) calls. On (B-E) the color of the cell corresponds to the median difference in F1 scores between the first tool (on the OX axis) and the second tool (on the OY axis); n.s. - the difference is not significant, * - p < 0.05, ** - p < 0.01, *** - p < 0.001 in the Wilcoxon paired signed rank test. Read aligners: BW - BWA MEM, BT-E2E - Bowtie2 (end-to-end mode), IS - isaac4, NO - Novoalign; variant callers and filtering strategies: CL - Clair3, DV - DeepVariant, G1 - GATK HaplotypeCaller with 1D CNN filtering, G2 - GATK HaplotypeCaller with 2D CNN filtering, GH - GATK HaplotypeCaller with recommended hard filters. ST - Strelka2, FB - Freebayes, OS - Octopus with standard filtering, OF - Octopus with random forest filtering
Fig. 3
Fig. 3
The necessity and possible benefit of variant filtering depends on the variant calling method. Shown are positive changes in precision (blue boxes) and negative changes in recall (yellow boxes) on SNPs and INDELs in WES and WGS data. Gray shading corresponds to cases where filtering is beneficial, i.e. gain in precision is greater than the loss of recall. Variant callers and filtering strategies: CL - Clair3, DV - DeepVariant, G1 - GATK HaplotypeCaller with 1D CNN filtering, G2 - GATK HaplotypeCaller with 2D CNN filtering, GH - GATK HaplotypeCaller with recommended hard filters. ST - Strelka2, FB - Freebayes, OS - Octopus with standard filtering, OF - Octopus with random forest filtering
Fig. 4
Fig. 4
Dissection of factors affecting variant discovery pipelines’ performance. a Boxplots representing the F1 scores for SNP and indel calling with different variant callers in WES and WGS datasets. b The relationship between the distance from CDS boundary and the median F1 score of variant calling in WES and WGS data for SNP (top) and indel (bottom) variants. Results shown were obtained using the best-performing aligner (Novoalign) and three best-performing variant calling methods (DeepVariant, Strelka2, GATK-HC with hard filtering). c Comparison of the variant performance of variant callers in regions with different levels of expected normalized coverage (left), GC-content (middle), and fraction of non-unique mappers (right). Histograms on top of each plot represent the distribution of each parameter (coverage, GC content, MF) across GRCh37 CDS regions. Results shown on (a-c) were obtained using the best-performing read aligner (Novoalign v. 4.02.01). For other aligners, please see Supplementary Figs. S5, S6 and S7. Variant callers and filtering strategies: CL - Clair3, DV - DeepVariant, G1 - GATK HaplotypeCaller with 1D CNN filtering, G2 - GATK HaplotypeCaller with 2D CNN filtering, GH - GATK HaplotypeCaller with recommended hard filters. ST - Strelka2, OS and OF - Octopus with standard filtering and random forest filtering, respectively, FB - FreeBayes
Fig. 5
Fig. 5
Variant calling and filtering methods perform similarly on GIAB and non-GIAB datasets. (a, c) A scatterplot showing the results of the principal component analysis in space of discordant variant calls on GIAB (a) and non-GIAB (c) data. b Distributions of the number of variant calling methods that detected each false negative (top) or false positive (bottom) variant. Note that the majority of FN and FP variants are represented by unique non-calls and unique calls, respectively. d Boxplots showing the number of unique calls (top) and unique non-calls (bottom) on GIAB and non-GIAB datasets for the indicated variant calling and filtering methods Variant callers and filtering strategies: CL - Clair3, DV - DeepVariant, G1 - GATK HaplotypeCaller with 1D CNN filtering, G2 - GATK HaplotypeCaller with 2D CNN filtering, GH - GATK HaplotypeCaller with recommended hard filters. ST - Strelka2, FB - Freebayes, OS - Octopus with standard filtering, OF - Octopus with random forest filtering

References

    1. van Dijk EL, Auger H, Jaszczyszyn Y, Thermes C. Ten years of next-generation sequencing technology. Trends Genet. 2014;30:418–426. doi: 10.1016/j.tig.2014.07.001. - DOI - PubMed
    1. Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, Collins RL, Laricchia KM, Ganna A, Birnbaum DP, Gauthier LD, Brand H, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. - DOI - PMC - PubMed
    1. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, Motyer A, Vukcevic D, Delaneau O, O’Connell J, Cortes A, Welsh S, et al. The UK biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. - DOI - PMC - PubMed
    1. Barbitoff YA, Bezdvornykh IV, Polev DE, Serebryakova EA, Glotov AS, Glotov OS, Predeus AV. Catching hidden variation: systematic correction of reference minor allele annotation in clinical variant calling. Genet Med. 2018;20:360–364. doi: 10.1038/gim.2017.168. - DOI - PubMed
    1. van der Auwera GA, Carneiro MO, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella KV, et al. From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinforma. 2013;10:1–10.33. - PMC - PubMed

LinkOut - more resources