Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jul 29:16:235.
doi: 10.1186/s12859-015-0624-y.

Evaluation of variant detection software for pooled next-generation sequence data

Collaborators, Affiliations

Evaluation of variant detection software for pooled next-generation sequence data

Howard W Huang et al. BMC Bioinformatics. .

Abstract

Background: Despite the tremendous drop in the cost of nucleotide sequencing in recent years, many research projects still utilize sequencing of pools containing multiple samples for the detection of sequence variants as a cost saving measure. Various software tools exist to analyze these pooled sequence data, yet little has been reported on the relative accuracy and ease of use of these different programs.

Results: In this manuscript we evaluate five different variant detection programs-The Genome Analysis Toolkit (GATK), CRISP, LoFreq, VarScan, and SNVer-with regard to their ability to detect variants in synthetically pooled Illumina sequencing data, by creating simulated pooled binary alignment/map (BAM) files using single-sample sequencing data from varying numbers of previously characterized samples at varying depths of coverage per sample. We report the overall runtimes and memory usage of each program, as well as each program's sensitivity and specificity to detect known true variants.

Conclusions: GATK, CRISP, and LoFreq all gave balanced accuracy of 80% or greater for datasets with varying per-sample depth of coverage and numbers of samples per pool. VarScan and SNVer generally had balanced accuracy lower than 80%. CRISP and LoFreq required up to four times less computational time and up to ten times less physical memory than GATK did, and without filtering, gave results with the highest sensitivity. VarScan and SNVer had generally lower false positive rates, but also significantly lower sensitivity than the other three programs.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Effects of Pool Size on Program Balanced Accuracies. “Balanced accuracy” is defined as the mean of the sensitivity and 1 minus the false positive rate. No data point is reported for GATK with 16 or 32 samples because runs did not complete within a reasonable timeframe. Values are plotted for (a) ClinSeq and (b) Thousand Genomes pools containing read depth 50 % of a typical whole exome, which was 35.1x, on average, for ClinSeq samples and 21.0x, on average, for Thousand Genomes samples
Fig. 2
Fig. 2
Effects of Pool Coverage on Program Balanced Accuracies. “Balanced accuracy” is defined as the mean of the sensitivity and 1 minus the false positive rate. Values are plotted for various fractions of “full coverage” for (a) ClinSeq pools containing eight individuals and (b) Thousand Genomes pools containing four individuals
Fig. 3
Fig. 3
ROC Analysis on VCFs generated from ClinSeq eight sample, 50 % coverage pools with a total of 35.1x depth of coverage, on average, with eight pools per program run. For CRISP and GATK, quality score filtering was gradually increased on a logarithmic scale (0–100,100-1000,1000-10,000, etc.) to obtain a full range of sensitivity and false positive scores. LoFreq’s filtering was incremented logarithmically up to 1000, then by 100 s since its quality score range was smaller than those of the other programs. Many of SNVer’s P-values were extremely small (with reported p-values as low as 0), so maximum p-value filtering was set at values from 10−10 down to 10−300. Full details of score thresholds used are contained in the worksheet titled “Supp Table S2 Main Paper Figure S3” in the Additional file 1: Figure S3

References

    1. Wetterstrand KA: DNA sequencing costs: Data from the NHGRI genome sequencing program (GSP). 2014 [http://www.genome.gov/sequencingcosts]. Accessed October 10, 2014.
    1. McClellan J, King MC. Genetic heterogeneity in human disease. Cell. 2010;141(2):210–7. doi: 10.1016/j.cell.2010.03.032. - DOI - PubMed
    1. Grada A, Weinbrecht K. Next-generation sequencing: methodology and application. J Invest Dermatol. 2013;133(8):e11. doi: 10.1038/jid.2013.248. - DOI - PubMed
    1. Baltagi BH, Bresson G, Pirotte, A: To pool or not to pool? The econometrics of panel data (pp. 517–546) Springer Berlin Heidelberg 2008.
    1. Bansal V. A statistical method for the detection of variants from next-generation resequencing of DNA pools. Bioinformatics. 2010;26(12):i318–24. doi: 10.1093/bioinformatics/btq214. - DOI - PMC - PubMed

Publication types

LinkOut - more resources