Quake: quality-aware detection and correction of sequencing errors

David R Kelley¹, Michael C Schatz, Steven L Salzberg

Affiliations

Affiliation

¹ Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, and Department of Computer Science, University of Maryland, College Park, MD 20742, USA. dakelley@umiacs.umd.edu

PMID: 21114842
PMCID: PMC3156955
DOI: 10.1186/gb-2010-11-11-r116

Quake: quality-aware detection and correction of sequencing errors

David R Kelley et al. Genome Biol. 2010.

. 2010;11(11):R116.

doi: 10.1186/gb-2010-11-11-r116. Epub 2010 Nov 29.

Authors

David R Kelley¹, Michael C Schatz, Steven L Salzberg

Affiliation

¹ Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, and Department of Computer Science, University of Maryland, College Park, MD 20742, USA. dakelley@umiacs.umd.edu

PMID: 21114842
PMCID: PMC3156955
DOI: 10.1186/gb-2010-11-11-r116

Abstract

We introduce Quake, a program to detect and correct errors in DNA sequencing reads. Using a maximum likelihood approach incorporating quality values and nucleotide specific miscall rates, Quake achieves the highest accuracy on realistically simulated reads. We further demonstrate substantial improvements in de novo assembly and SNP detection after using Quake. Quake can be used for any size project, including more than one billion human reads, and is freely available as open source software from http://www.cbcb.umd.edu/software/quake.

PubMed Disclaimer

Figures

**Figure 1**
**Alignment difficulty**. Detecting alignments of short reads is more difficult in the presence of sequencing errors (represented as X's). (a) In the case of genome assembly, we may miss short overlaps between reads containing sequencing errors, particularly because the errors tend to occur at the ends of the reads. (b) To find variations between the sequenced genome and a reference genome, we typically first map the reads to the reference. However, reads containing variants (represented as stars) and sequencing errors will have too many mismatches and not align to their true genomic location.

**Figure 2**
**Adenine error rate**. The observed error rate and predicted error rate after nonparametric regression are plotted for adenine by quality value for a single lane of Illumina sequencing of *Megachile rotundata*. The number of training instances at each quality value are drawn as a histogram below the plot. At low and medium quality values, adenine is far more likely to be miscalled as cytosine than thymine or guanine. However, the distribution at high quality is more uniform.

**Figure 3**
**k-mer coverage**. 15-mer coverage model fit to 76× coverage of 36 bp reads from *E. coli*. Note that the expected coverage of a k-mer in the genome using reads of length L will be $\frac{L - k + 1}{L}$ times the expected coverage of a single nucleotide because the full k-mer must be covered by the read. Above, q-mer counts are binned at integers in the histogram. The error k-mer distribution rises outside the displayed region to 0.032 at coverage two and 0.691 at coverage one. The mixture parameter for the prior probability that a k-mer's coverage is from the error distribution is 0.73. The mean and variance for true k-mers are 41 and 77 suggesting that a coverage bias exists as the variance is almost twice the theoretical 41 suggested by the Poisson distribution. The likelihood ratio of error to true k-mer is one at a coverage of seven, but we may choose a smaller cutoff for some applications.

**Figure 4**
**Localize errors**. Trusted (green) and untrusted (red) 15-mers are drawn against a 36 bp read. In (a), the intersection of the untrusted k-mers localizes the sequencing error to the highlighted column. In (b), the untrusted k-mers reach the edge of the read, so we must consider the bases at the edge in addition to the intersection of the untrusted k-mers. However, in most cases, we can further localize the error by considering all bases covered by the right-most trusted k-mer to be correct and removing them from the error region as shown in (c).

**Figure 5**
**Correction search algorithm**. Pseudocode for the algorithm to search for the most likely set of corrections that makes all k-mers in the read trusted. P is a heap-based priority queue that sorts sets of corrections C by their likelihood ratio L. The algorithm examines sets of corrections in decreasing order of their likelihood until a set is found that converts all k-mers in the read to trusted k-mers.

**Figure 6**
**Correction search**. The search for the proper set of corrections that change an observed read with errors into the actual sequence from the genome can be viewed as exploring a tree. Nodes in the tree represent possible corrected reads (and implicitly sets of corrections to the observed read). Branches in the tree represent corrections. Each node can be assigned a likelihood by our model for sequencing errors as described in the text. Quake's algorithm visits the nodes in order of decreasing likelihood until a valid read is found or the threshold is passed.

See this image and copyright information in PMC

References

1. Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol. 2008;26:1135–1145. doi: 10.1038/nbt1486. - DOI - PubMed
1. Hawkins R, Hon G, Ren B. Next-generation genomics: an integrative approach. Nat Rev Genet. 2010;11:476–486. - PMC - PubMed
1. Siva N. 1000 Genomes project. Nat Biotechnol. 2008;26:256. - PubMed
1. Haussler D, O'Brien S, Ryder O, Barker F, Clamp M, Crawford A, Hanner R, Hanotte O, Johnson W, McGuire J, Miller W, Murphy R, Murphy W, Sheldon F, Sinervo B, Venkatesh B, Wiley E, Allendorf F, Amato G, Baker C, Bauer A, Beja-Pereira A, Bermingham E, Bernardi G, Bonvicino C, Brenner S, Burke T, Cracraft J, Diekhans M, Edwards S. et al.Genome 10K: a proposal to obtain whole-genome sequence for 10 000 vertebrate species. J Hered. 2009;100:659–674. doi: 10.1093/jhered/esp086. - DOI - PMC - PubMed
1. Mardis E. Cancer genomics identifies determinants of tumor biology. Genome Biol. 2010;11:211. doi: 10.1186/gb-2010-11-5-211. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Quake: quality-aware detection and correction of sequencing errors

Affiliation

Quake: quality-aware detection and correction of sequencing errors

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources