ECHO: a reference-free short-read error correction algorithm

Wei-Chun Kao¹, Andrew H Chan, Yun S Song

Affiliations

PMID: 21482625
PMCID: PMC3129260
DOI: 10.1101/gr.111351.110

ECHO: a reference-free short-read error correction algorithm

Wei-Chun Kao et al. Genome Res. 2011 Jul.

. 2011 Jul;21(7):1181-92.

doi: 10.1101/gr.111351.110. Epub 2011 Apr 11.

Authors

Wei-Chun Kao¹, Andrew H Chan, Yun S Song

Affiliation

¹ Computer Science Division, University of California-Berkeley, CA 94721, USA.

PMID: 21482625
PMCID: PMC3129260
DOI: 10.1101/gr.111351.110

Abstract

Developing accurate, scalable algorithms to improve data quality is an important computational challenge associated with recent advances in high-throughput sequencing technology. In this study, a novel error-correction algorithm, called ECHO, is introduced for correcting base-call errors in short-reads, without the need of a reference genome. Unlike most previous methods, ECHO does not require the user to specify parameters of which optimal values are typically unknown a priori. ECHO automatically sets the parameters in the assumed model and estimates error characteristics specific to each sequencing run, while maintaining a running time that is within the range of practical use. ECHO is based on a probabilistic model and is able to assign a quality score to each corrected base. Furthermore, it explicitly models heterozygosity in diploid genomes and provides a reference-free method for detecting bases that originated from heterozygous sites. On both real and simulated data, ECHO is able to improve the accuracy of previous error-correction methods by several folds to an order of magnitude, depending on the sequence coverage depth and the position in the read. The improvement is most pronounced toward the end of the read, where previous methods become noticeably less effective. Using a whole-genome yeast data set, it is demonstrated here that ECHO is capable of coping with nonuniform coverage. Also, it is shown that using ECHO to perform error correction as a preprocessing step considerably facilitates de novo assembly, particularly in the case of low-to-moderate sequence coverage depth.

PubMed Disclaimer

Figures

**Figure 1.**
Illustration of the alignment of two reads by identifying a common k-mer in the overlap. The boxes labeled by K denote a common k-mer between the two reads. The symbol × denotes a sequencing error that would preclude a common k-mer from occurring in the immediate area. However, as long as the error rate is low enough and the overlap is sufficiently long, a common k-mer will exist with high probability.

**Figure 2.**
Illustration of the empirical coverage distribution compared with the Poisson distribution. The empirical distribution is drawn with a dashed line and the Poisson distribution is drawn with a solid line. The total variation distance is the shaded area between the two distributions divided by two. ECHO finds the Poisson distribution minimizing the total variation distance to the empirical distribution to find the optimal set of parameters ω* and ε*.

**Figure 3.**
Position-specific by-base error rates for 76-bp PhiX174 data D1. (A) Error rates before and after applying the three error-correction algorithms for sequence coverage depth 30. Spectral alignment (or SA) (see Chaisson et al. 2009) and SHREC (Schröder et al. 2009) are able to improve the error rate in intermediate positions, but they both become less effective for later positions. In contrast, ECHO remains effective throughout the entire read length, reducing the error rate at the end of the read from about 5% to under 1%. (B) Error rates before and after running ECHO with varying coverage depths. ECHO's ability to correct sequencing errors improves as the sequence coverage depth increases. A coverage depth of 15 seems sufficient to control the error rate throughout the entire read length.

**Figure 4.**
The gain of ECHO and the position-specific coverage for chromosome 1 of the yeast data D6. Each plot uses bins of 1000 bp. The *top* plot shows the gain of ECHO, defined as the number of corrected errors minus the number of introduced errors, divided by the number of actual errors. The *bottom* plot shows the position-specific coverage.

See this image and copyright information in PMC

References

1. Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES 2002. Arachne: a whole-genome shotgun assembler. Genome Res 12: 177–189 - PMC - PubMed
1. Birol I, Jackman S, Nielsen C, Qian J, Varhol R, Stazyk G, Morin R, Zhao Y, Hirst M, Schein J, et al. 2009. De novo transcriptome assembly with ABySS. Bioinformatics 25: 2872–2877 - PubMed
1. Brockman W, Alvarez P, Young S, Garber M, Giannoukos G, Lee WL, Russ C, Lander ES, Nusbaum C, Jaffe DB, et al. 2008. Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res 18: 763–770 - PMC - PubMed
1. Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB 2008. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res 18: 810–820 - PMC - PubMed
1. Chaisson M, Pevzner P, Tang H 2004. Fragment assembly with short reads. Bioinformatics 20: 2067–2074 - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- FlyBase
- Saccharomyces Genome Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ECHO: a reference-free short-read error correction algorithm

Affiliation

ECHO: a reference-free short-read error correction algorithm

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Miscellaneous