Comparative Study

. 2013 Sep 27;8(9):e75619.

doi: 10.1371/journal.pone.0075619. eCollection 2013.

Variant callers for next-generation sequencing data: a comparison study

Xiangtao Liu¹, Shizhong Han, Zuoheng Wang, Joel Gelernter, Bao-Zhu Yang

Affiliations

Affiliation

¹ Department of Psychiatry, Division of Human Genetics, Yale University School of Medicine, New Haven, Connecticut, United States of America ; VA CT Health Care Center, West Haven, Connecticut, United States of America.

PMID: 24086590
PMCID: PMC3785481
DOI: 10.1371/journal.pone.0075619

Comparative Study

Variant callers for next-generation sequencing data: a comparison study

Xiangtao Liu et al. PLoS One. 2013.

. 2013 Sep 27;8(9):e75619.

doi: 10.1371/journal.pone.0075619. eCollection 2013.

Authors

Xiangtao Liu¹, Shizhong Han, Zuoheng Wang, Joel Gelernter, Bao-Zhu Yang

Affiliation

¹ Department of Psychiatry, Division of Human Genetics, Yale University School of Medicine, New Haven, Connecticut, United States of America ; VA CT Health Care Center, West Haven, Connecticut, United States of America.

PMID: 24086590
PMCID: PMC3785481
DOI: 10.1371/journal.pone.0075619

Abstract

Next generation sequencing (NGS) has been leading the genetic study of human disease into an era of unprecedented productivity. Many bioinformatics pipelines have been developed to call variants from NGS data. The performance of these pipelines depends crucially on the variant caller used and on the calling strategies implemented. We studied the performance of four prevailing callers, SAMtools, GATK, glftools and Atlas2, using single-sample and multiple-sample variant-calling strategies. Using the same aligner, BWA, we built four single-sample and three multiple-sample calling pipelines and applied the pipelines to whole exome sequencing data taken from 20 individuals. We obtained genotypes generated by Illumina Infinium HumanExome v1.1 Beadchip for validation analysis and then used Sanger sequencing as a "gold-standard" method to resolve discrepancies for selected regions of high discordance. Finally, we compared the sensitivity of three of the single-sample calling pipelines using known simulated whole genome sequence data as a gold standard. Overall, for single-sample calling, the called variants were highly consistent across callers and the pairwise overlapping rate was about 0.9. Compared with other callers, GATK had the highest rediscovery rate (0.9969) and specificity (0.99996), and the Ti/Tv ratio out of GATK was closest to the expected value of 3.02. Multiple-sample calling increased the sensitivity. Results from the simulated data suggested that GATK outperformed SAMtools and glfSingle in sensitivity, especially for low coverage data. Further, for the selected discrepant regions evaluated by Sanger sequencing, variant genotypes called by exome sequencing versus the exome array were more accurate, although the average variant sensitivity and overall genotype consistency rate were as high as 95.87% and 99.82%, respectively. In conclusion, GATK showed several advantages over other variant callers for general purpose NGS analyses. The GATK pipelines we developed perform very well.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Unified steps of the pipelines.**
Blue rounded rectangles represent the reads, blue rectangles represent mapping-QC procedures, red callouts indicate the tools. The dashed curve arrow represents a reduced version skipping the mapping-QC steps.

**Figure 2. Boxplots of measure for validation by the pipelines.**
a. Number of SNPs. b. Ti/Tv ratio. c. Number of indels by the pipelines. d. True positive calls by the pipelines. e. False positive calls plus error genotypes by the pipelines. f. Re-discovery rate (positive prediction value) by the pipelines. g. Sensitivities of the pipelines. h. Specificities of the pipelines. The green bars indicate the first quartiles, red bars extend to medians, blue bars reach the third quartile, and error bar caps show the ranges. SAMt, glfS and glfM stand for SAMtools, glfSingle and glfMultiples respectively. “_S” and “_M” represent single and multiple calling strategies. R and F represent raw and filtered variants.

**Figure 3. Shared variants by single-sample pipelines and their validation.**
a. Average pairwise overlapping between filtered variants called by SAMtools (blue), GATK (red), glfSingle (olive green) and Atlas2 (purple). b and c. Boxplots of sensitivities and specificities for shared variants. P13 stands for shared variants between pipeline 1 (SAMtools) and 3 (GATK), P135 stands for shared variants by pipeline 1, 3 and 5 (glfSingle), and so on.

**Figure 4. Positive prediction value and sensitivity of callers for WGS data simulated at different coverage settings.**
The SAMtools, GATK, glfSingle labels represent the sensitivities for SNPs, Stindel represents the sensitivity for indels called by SAMtools. a. Positive prediction value. b. Sensitivity.

See this image and copyright information in PMC

References

1. Nielsen R, Paul JS, Albrechtsen A, Song YS (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12: 443-451. doi:10.1038/nrg2986. PubMed: 21587300. - DOI - PMC - PubMed
1. Ruffalo M, LaFramboise T, Koyutürk M (2011) Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 27: 2790-2796. doi:10.1093/bioinformatics/btr477. PubMed: 21856737. - DOI - PubMed
1. Pattnaik S, Vaidyanathan S, Pooja DG, Deepak S, Panda B (2012) Customisation of the Exome Data Analysis Pipeline Using a Combinatorial Approach. PLOS ONE 7: e30080. doi:10.1371/journal.pone.0030080. PubMed: 22238694. - DOI - PMC - PubMed
1. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25: 1754-1760. doi:10.1093/bioinformatics/btp324. PubMed: 19451168. - DOI - PMC - PubMed
1. Bao S, Jiang R, Kwan W, Wang B, Ma X et al. (2011) Evaluation of next-generation sequencing software in mapping and assembly. J Hum Genet 56: 406-414. doi:10.1038/jhg.2011.43. PubMed: 21525877. - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Variant callers for next-generation sequencing data: a comparison study

Affiliation

Variant callers for next-generation sequencing data: a comparison study

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous