Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2013 Sep 27;8(9):e75619.
doi: 10.1371/journal.pone.0075619. eCollection 2013.

Variant callers for next-generation sequencing data: a comparison study

Affiliations
Comparative Study

Variant callers for next-generation sequencing data: a comparison study

Xiangtao Liu et al. PLoS One. .

Abstract

Next generation sequencing (NGS) has been leading the genetic study of human disease into an era of unprecedented productivity. Many bioinformatics pipelines have been developed to call variants from NGS data. The performance of these pipelines depends crucially on the variant caller used and on the calling strategies implemented. We studied the performance of four prevailing callers, SAMtools, GATK, glftools and Atlas2, using single-sample and multiple-sample variant-calling strategies. Using the same aligner, BWA, we built four single-sample and three multiple-sample calling pipelines and applied the pipelines to whole exome sequencing data taken from 20 individuals. We obtained genotypes generated by Illumina Infinium HumanExome v1.1 Beadchip for validation analysis and then used Sanger sequencing as a "gold-standard" method to resolve discrepancies for selected regions of high discordance. Finally, we compared the sensitivity of three of the single-sample calling pipelines using known simulated whole genome sequence data as a gold standard. Overall, for single-sample calling, the called variants were highly consistent across callers and the pairwise overlapping rate was about 0.9. Compared with other callers, GATK had the highest rediscovery rate (0.9969) and specificity (0.99996), and the Ti/Tv ratio out of GATK was closest to the expected value of 3.02. Multiple-sample calling increased the sensitivity. Results from the simulated data suggested that GATK outperformed SAMtools and glfSingle in sensitivity, especially for low coverage data. Further, for the selected discrepant regions evaluated by Sanger sequencing, variant genotypes called by exome sequencing versus the exome array were more accurate, although the average variant sensitivity and overall genotype consistency rate were as high as 95.87% and 99.82%, respectively. In conclusion, GATK showed several advantages over other variant callers for general purpose NGS analyses. The GATK pipelines we developed perform very well.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Unified steps of the pipelines.
Blue rounded rectangles represent the reads, blue rectangles represent mapping-QC procedures, red callouts indicate the tools. The dashed curve arrow represents a reduced version skipping the mapping-QC steps.
Figure 2
Figure 2. Boxplots of measure for validation by the pipelines.
a. Number of SNPs. b. Ti/Tv ratio. c. Number of indels by the pipelines. d. True positive calls by the pipelines. e. False positive calls plus error genotypes by the pipelines. f. Re-discovery rate (positive prediction value) by the pipelines. g. Sensitivities of the pipelines. h. Specificities of the pipelines. The green bars indicate the first quartiles, red bars extend to medians, blue bars reach the third quartile, and error bar caps show the ranges. SAMt, glfS and glfM stand for SAMtools, glfSingle and glfMultiples respectively. “_S” and “_M” represent single and multiple calling strategies. R and F represent raw and filtered variants.
Figure 3
Figure 3. Shared variants by single-sample pipelines and their validation.
a. Average pairwise overlapping between filtered variants called by SAMtools (blue), GATK (red), glfSingle (olive green) and Atlas2 (purple). b and c. Boxplots of sensitivities and specificities for shared variants. P13 stands for shared variants between pipeline 1 (SAMtools) and 3 (GATK), P135 stands for shared variants by pipeline 1, 3 and 5 (glfSingle), and so on.
Figure 4
Figure 4. Positive prediction value and sensitivity of callers for WGS data simulated at different coverage settings.
The SAMtools, GATK, glfSingle labels represent the sensitivities for SNPs, Stindel represents the sensitivity for indels called by SAMtools. a. Positive prediction value. b. Sensitivity.

References

    1. Nielsen R, Paul JS, Albrechtsen A, Song YS (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12: 443-451. doi:10.1038/nrg2986. PubMed: 21587300. - DOI - PMC - PubMed
    1. Ruffalo M, LaFramboise T, Koyutürk M (2011) Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 27: 2790-2796. doi:10.1093/bioinformatics/btr477. PubMed: 21856737. - DOI - PubMed
    1. Pattnaik S, Vaidyanathan S, Pooja DG, Deepak S, Panda B (2012) Customisation of the Exome Data Analysis Pipeline Using a Combinatorial Approach. PLOS ONE 7: e30080. doi:10.1371/journal.pone.0030080. PubMed: 22238694. - DOI - PMC - PubMed
    1. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25: 1754-1760. doi:10.1093/bioinformatics/btp324. PubMed: 19451168. - DOI - PMC - PubMed
    1. Bao S, Jiang R, Kwan W, Wang B, Ma X et al. (2011) Evaluation of next-generation sequencing software in mapping and assembly. J Hum Genet 56: 406-414. doi:10.1038/jhg.2011.43. PubMed: 21525877. - DOI - PubMed

Publication types