Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Sep 4:13:221.
doi: 10.1186/1471-2105-13-221.

ReQON: a Bioconductor package for recalibrating quality scores from next-generation sequencing data

Affiliations

ReQON: a Bioconductor package for recalibrating quality scores from next-generation sequencing data

Christopher R Cabanski et al. BMC Bioinformatics. .

Abstract

Background: Next-generation sequencing technologies have become important tools for genome-wide studies. However, the quality scores that are assigned to each base have been shown to be inaccurate. If the quality scores are used in downstream analyses, these inaccuracies can have a significant impact on the results.

Results: Here we present ReQON, a tool that recalibrates the base quality scores from an input BAM file of aligned sequencing data using logistic regression. ReQON also generates diagnostic plots showing the effectiveness of the recalibration. We show that ReQON produces quality scores that are both more accurate, in the sense that they more closely correspond to the probability of a sequencing error, and do a better job of discriminating between sequencing errors and non-errors than the original quality scores. We also compare ReQON to other available recalibration tools and show that ReQON is less biased and performs favorably in terms of quality score accuracy.

Conclusion: ReQON is an open source software package, written in R and available through Bioconductor, for recalibrating base quality scores for next-generation sequencing data. ReQON produces a new BAM file with more accurate quality scores, which can improve the results of downstream analysis, and produces several diagnostic plots showing the effectiveness of the recalibration.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Recalibration of U87 cell line replicate 1 with ReQON. Plot A shows the distribution of sequencing errors by read position. Plot B shows frequency distributions of quality scores before (solid blue) and after (dashed red) recalibration. Reported quality scores versus empirical quality scores are shown before (plot C) and after (plot D) recalibration. The points are shaded according to the frequency of bases assigned that quality score, corresponding to the values shown in plot B. Plots C and D also report the Frequency-Weighted Squared Error (FWSE), a measure of quality score accuracy. The large decrease in FWSE confirms that the recalibrated quality scores more accurately represent the probability of a sequencing error than the original quality scores.
Figure 2
Figure 2
Discrimination performance of original and ReQON-recalibrated quality scores. Relative frequency distributions of quality scores for bases not matching the reference sequence in chromosome 20 of cell line replicate 2. These non-reference bases are separated as belonging to positions in dbSNP version 132 (known variants, red curve) versus other positions (sequencing errors, blue curve). Plot A shows the distribution of original quality scores and plot B shows the distribution after recalibration with ReQON. The area under the ROC curve (AUC) is reported. The increased AUC demonstrates that the recalibrated quality scores do a better job of distinguishing sequencing errors from non-errors.
Figure 3
Figure 3
Example position where bases are identified as sequencing errors by GATK but not ReQON. Plot A shows an Integrative Genomics Viewer (IGV) visualization of chr10:75,531,679-75,531,712 for cell line replicate 1, highlighting a position where the reference sequence is T but all of the bases mapped to this position are a C. This position (chr10:75,531,700) is not listed as a known variant in dbSNP version 132. The bases at this position are removed from the training set by ReQON but are called as sequencing errors by GATK. Plot B shows box plots comparing the quality scores of the bases at this position after recalibration with GATK and ReQON. Overall, ReQON assigns higher quality scores to these non-reference bases than GATK.

References

    1. Lamlertthon W, Hayward MC, Hayes DN. Emerging technologies for improved stratification of cancer patients: a review of opportunities, challenges, and tools. Cancer J. 2011;17:451–464. doi: 10.1097/PPO.0b013e31823bd1f8. - DOI - PubMed
    1. Zhang J, Chiodini R, Badr A, Zhang G. The impact of next-generation sequencing on genomics. J Genet Genomics. 2011;38:95–109. doi: 10.1016/j.jgg.2011.02.003. - DOI - PMC - PubMed
    1. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43:491–498. doi: 10.1038/ng.806. - DOI - PMC - PubMed
    1. Li R, Li Y, Fang X, Yang H, Wang J, Kristiansen K, Wang J. SNP detection for massively parallel whole-genome resequencing. Genome Res. 2009;19:1124–1132. doi: 10.1101/gr.088013.108. - DOI - PMC - PubMed
    1. Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011;12:443–451. doi: 10.1038/nrg2986. - DOI - PMC - PubMed

Publication types