. 2015 Jul 29:16:235.

doi: 10.1186/s12859-015-0624-y.

Evaluation of variant detection software for pooled next-generation sequence data

Howard W Huang¹; NISC Comparative Sequencing Program; James C Mullikin², Nancy F Hansen³

Collaborators, Affiliations

Collaborators

Affiliations

¹ National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. hhuang58@jhu.edu.
² National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. mullikin@mail.nih.gov.
³ National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. nhansen@mail.nih.gov.

PMID: 26220471
PMCID: PMC4518579
DOI: 10.1186/s12859-015-0624-y

Evaluation of variant detection software for pooled next-generation sequence data

Howard W Huang et al. BMC Bioinformatics. 2015.

. 2015 Jul 29:16:235.

doi: 10.1186/s12859-015-0624-y.

Authors

Howard W Huang¹; NISC Comparative Sequencing Program; James C Mullikin², Nancy F Hansen³

Collaborators

Affiliations

¹ National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. hhuang58@jhu.edu.
² National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. mullikin@mail.nih.gov.
³ National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. nhansen@mail.nih.gov.

PMID: 26220471
PMCID: PMC4518579
DOI: 10.1186/s12859-015-0624-y

Abstract

Background: Despite the tremendous drop in the cost of nucleotide sequencing in recent years, many research projects still utilize sequencing of pools containing multiple samples for the detection of sequence variants as a cost saving measure. Various software tools exist to analyze these pooled sequence data, yet little has been reported on the relative accuracy and ease of use of these different programs.

Results: In this manuscript we evaluate five different variant detection programs-The Genome Analysis Toolkit (GATK), CRISP, LoFreq, VarScan, and SNVer-with regard to their ability to detect variants in synthetically pooled Illumina sequencing data, by creating simulated pooled binary alignment/map (BAM) files using single-sample sequencing data from varying numbers of previously characterized samples at varying depths of coverage per sample. We report the overall runtimes and memory usage of each program, as well as each program's sensitivity and specificity to detect known true variants.

Conclusions: GATK, CRISP, and LoFreq all gave balanced accuracy of 80% or greater for datasets with varying per-sample depth of coverage and numbers of samples per pool. VarScan and SNVer generally had balanced accuracy lower than 80%. CRISP and LoFreq required up to four times less computational time and up to ten times less physical memory than GATK did, and without filtering, gave results with the highest sensitivity. VarScan and SNVer had generally lower false positive rates, but also significantly lower sensitivity than the other three programs.

PubMed Disclaimer

Figures

**Fig. 1**
Effects of Pool Size on Program Balanced Accuracies. “Balanced accuracy” is defined as the mean of the sensitivity and 1 minus the false positive rate. No data point is reported for GATK with 16 or 32 samples because runs did not complete within a reasonable timeframe. Values are plotted for (a) ClinSeq and (b) Thousand Genomes pools containing read depth 50 % of a typical whole exome, which was 35.1x, on average, for ClinSeq samples and 21.0x, on average, for Thousand Genomes samples

**Fig. 2**
Effects of Pool Coverage on Program Balanced Accuracies. “Balanced accuracy” is defined as the mean of the sensitivity and 1 minus the false positive rate. Values are plotted for various fractions of “full coverage” for (a) ClinSeq pools containing eight individuals and (b) Thousand Genomes pools containing four individuals

**Fig. 3**
ROC Analysis on VCFs generated from ClinSeq eight sample, 50 % coverage pools with a total of 35.1x depth of coverage, on average, with eight pools per program run. For CRISP and GATK, quality score filtering was gradually increased on a logarithmic scale (0–100,100-1000,1000-10,000, etc.) to obtain a full range of sensitivity and false positive scores. LoFreq’s filtering was incremented logarithmically up to 1000, then by 100 s since its quality score range was smaller than those of the other programs. Many of SNVer’s P-values were extremely small (with reported p-values as low as 0), so maximum p-value filtering was set at values from 10⁻¹⁰ down to 10⁻³⁰⁰. Full details of score thresholds used are contained in the worksheet titled “Supp Table S2 Main Paper Figure S3” in the Additional file 1: Figure S3

See this image and copyright information in PMC

References

1. Wetterstrand KA: DNA sequencing costs: Data from the NHGRI genome sequencing program (GSP). 2014 [http://www.genome.gov/sequencingcosts]. Accessed October 10, 2014.
1. McClellan J, King MC. Genetic heterogeneity in human disease. Cell. 2010;141(2):210–7. doi: 10.1016/j.cell.2010.03.032. - DOI - PubMed
1. Grada A, Weinbrecht K. Next-generation sequencing: methodology and application. J Invest Dermatol. 2013;133(8):e11. doi: 10.1038/jid.2013.248. - DOI - PubMed
1. Baltagi BH, Bresson G, Pirotte, A: To pool or not to pool? The econometrics of panel data (pp. 517–546) Springer Berlin Heidelberg 2008.
1. Bansal V. A statistical method for the detection of variants from next-generation resequencing of DNA pools. Bioinformatics. 2010;26(12):i318–24. doi: 10.1093/bioinformatics/btq214. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

Intramural NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluation of variant detection software for pooled next-generation sequence data

Collaborators

Affiliations

Evaluation of variant detection software for pooled next-generation sequence data

Authors

Collaborators

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous