Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec 1;39(12):btad722.
doi: 10.1093/bioinformatics/btad722.

Benchmarking and improving the performance of variant-calling pipelines with RecallME

Affiliations

Benchmarking and improving the performance of variant-calling pipelines with RecallME

Gianluca Vozza et al. Bioinformatics. .

Abstract

Motivation: The steady increment of Whole Genome/Exome sequencing and the development of novel Next Generation Sequencing-based gene panels requires continuous testing and validation of variant calling (VC) pipelines and the detection of sequencing-related issues to be maintained up-to-date and feasible for the clinical settings. State of the art tools are reliable when used to compute standard performance metrics. However, the need for an automated software to discriminate between bioinformatic and sequencing issues and to optimize VC parameters remains unmet.

Results: The aim of the current work is to present RecallME, a bioinformatic suite that tracks down difficult-to-detect variants as insertions and deletions in highly repetitive regions, thus providing the maximum reachable recall for both single nucleotide variants and small insertion and deletions and to precisely guide the user in the pipeline optimization process.

Availability and implementation: Source code is freely available under MIT license at https://github.com/mazzalab-ieo/recallme. RecallME web application is available at https://translational-oncology-lab.shinyapps.io/recallme/. To use RecallME, users must obtain a license for ANNOVAR by themselves.

PubMed Disclaimer

Conflict of interest statement

V.F. and F.Z. are both employed at 4bases Italia s.r.l. G.V. is a former associate of 4bases Italia s.r.l. and currently employed at 4bases Italia s.r.l.

Figures

Figure 1.
Figure 1.
RecallME allows the maximization of the recall in reference samples. (A) Flowchart that shows how RecallME suite works with and without the bam file as input. First, bcftools norm splits the multi-allelic variants, then the query and the ground truth VCF files are converted in annovar inputs to harmonize variant notations. Bedtools genomecov function computes the number of bps that are not considered as high confidence by subtracting bps outside the bed file (to compute the number of true negatives and, consequently, the specificity). Bam-readcount look for SNVs and INDELs within the bam file to check if the recall can be maximized. The recaller.R and the pileup_recaller.R scripts compute the standard metrics as recall, precision, specificity, F1-score, FDR, and the Recall Max. (B) Barplots showing recall metrics in TVC and LoFreq pipelines (ION technology) and GATK-HC (Illumina-based) computed by som.py and RecallME (before and after BAM file re-check step) in SNV and INDEL calls in NA12878 sample. (C) Barplots showing recall metrics in GATK-HC and Pepper DV pipelines (Illumina and ONT-based) computed by som.py and RecallME (before and after BAM file re-check step) in SNV and INDEL calls in HD793 sample (Illumina and ONT). (D) Barplots showing accuracy metrics in Mutect2 pipeline computed by som.py and RecallME (before and after BAM file re-check step, i.e. RecallMax) in SNV and INDEL calls in SEQC2 somatic dataset. (E) Barplots showing recall metrics in TVC pipeline (ION technology) computed by hap.py and RecallME (before and after BAM file re-check step) in SNV and INDEL calls in NA12878 dataset. (F) TVC parameters distributions across TPs, FPs and b-FNs in SNV variants. Statistically significant differences have been found in VAF, DP, and STB (Mann–Whitney two-sided test). (G) TVC parameters distributions across TPs, FPs, and b-FNs in INDEL variants. Statistically significant differences have been found in VAF, DP, and STB (Mann–Whitney two-sided test). (H) Accuracies in TVC performances (NA12878) by tuning optimal cutpoints for VAF, DP, and STB for SNVs, INDELs and whole calls (WHOLE).

References

    1. Azzollini J, Agnelli L, Conca E. et al. Prevalence of BRCA homopolymeric indels in an ION torrent-based tumour-to-germline testing workflow in High-Grade ovarian carcinoma. Sci Rep 2023;13:7781. - PMC - PubMed
    1. Danecek P, Bonfield JK, Liddle J. et al. Twelve years of SAMtools and BCFtools. Gigascience 2021;10:giab008. - PMC - PubMed
    1. DePristo MA, Banks E, Poplin R. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011;43:491–8. - PMC - PubMed
    1. Fang LT, Zhu B, Zhao Y. et al.; Somatic Mutation Working Group of Sequencing Quality Control Phase II Consortium. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using Whole-Genome sequencing. Nat Biotechnol 2021;39:1151–60. - PMC - PubMed
    1. Jain C, Rhie A, Hansen N. et al. A long read mapping method for highly repetitive reference sequences. Nat Methods 2022. 10.1101/2020.11.01.363887. - DOI - PMC - PubMed

Publication types

Grants and funding