Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jul;21(7):1181-92.
doi: 10.1101/gr.111351.110. Epub 2011 Apr 11.

ECHO: a reference-free short-read error correction algorithm

Affiliations

ECHO: a reference-free short-read error correction algorithm

Wei-Chun Kao et al. Genome Res. 2011 Jul.

Abstract

Developing accurate, scalable algorithms to improve data quality is an important computational challenge associated with recent advances in high-throughput sequencing technology. In this study, a novel error-correction algorithm, called ECHO, is introduced for correcting base-call errors in short-reads, without the need of a reference genome. Unlike most previous methods, ECHO does not require the user to specify parameters of which optimal values are typically unknown a priori. ECHO automatically sets the parameters in the assumed model and estimates error characteristics specific to each sequencing run, while maintaining a running time that is within the range of practical use. ECHO is based on a probabilistic model and is able to assign a quality score to each corrected base. Furthermore, it explicitly models heterozygosity in diploid genomes and provides a reference-free method for detecting bases that originated from heterozygous sites. On both real and simulated data, ECHO is able to improve the accuracy of previous error-correction methods by several folds to an order of magnitude, depending on the sequence coverage depth and the position in the read. The improvement is most pronounced toward the end of the read, where previous methods become noticeably less effective. Using a whole-genome yeast data set, it is demonstrated here that ECHO is capable of coping with nonuniform coverage. Also, it is shown that using ECHO to perform error correction as a preprocessing step considerably facilitates de novo assembly, particularly in the case of low-to-moderate sequence coverage depth.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Illustration of the alignment of two reads by identifying a common k-mer in the overlap. The boxes labeled by K denote a common k-mer between the two reads. The symbol × denotes a sequencing error that would preclude a common k-mer from occurring in the immediate area. However, as long as the error rate is low enough and the overlap is sufficiently long, a common k-mer will exist with high probability.
Figure 2.
Figure 2.
Illustration of the empirical coverage distribution compared with the Poisson distribution. The empirical distribution is drawn with a dashed line and the Poisson distribution is drawn with a solid line. The total variation distance is the shaded area between the two distributions divided by two. ECHO finds the Poisson distribution minimizing the total variation distance to the empirical distribution to find the optimal set of parameters ω* and ε*.
Figure 3.
Figure 3.
Position-specific by-base error rates for 76-bp PhiX174 data D1. (A) Error rates before and after applying the three error-correction algorithms for sequence coverage depth 30. Spectral alignment (or SA) (see Chaisson et al. 2009) and SHREC (Schröder et al. 2009) are able to improve the error rate in intermediate positions, but they both become less effective for later positions. In contrast, ECHO remains effective throughout the entire read length, reducing the error rate at the end of the read from about 5% to under 1%. (B) Error rates before and after running ECHO with varying coverage depths. ECHO's ability to correct sequencing errors improves as the sequence coverage depth increases. A coverage depth of 15 seems sufficient to control the error rate throughout the entire read length.
Figure 4.
Figure 4.
The gain of ECHO and the position-specific coverage for chromosome 1 of the yeast data D6. Each plot uses bins of 1000 bp. The top plot shows the gain of ECHO, defined as the number of corrected errors minus the number of introduced errors, divided by the number of actual errors. The bottom plot shows the position-specific coverage.

References

    1. Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES 2002. Arachne: a whole-genome shotgun assembler. Genome Res 12: 177–189 - PMC - PubMed
    1. Birol I, Jackman S, Nielsen C, Qian J, Varhol R, Stazyk G, Morin R, Zhao Y, Hirst M, Schein J, et al. 2009. De novo transcriptome assembly with ABySS. Bioinformatics 25: 2872–2877 - PubMed
    1. Brockman W, Alvarez P, Young S, Garber M, Giannoukos G, Lee WL, Russ C, Lander ES, Nusbaum C, Jaffe DB, et al. 2008. Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res 18: 763–770 - PMC - PubMed
    1. Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB 2008. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res 18: 810–820 - PMC - PubMed
    1. Chaisson M, Pevzner P, Tang H 2004. Fragment assembly with short reads. Bioinformatics 20: 2067–2074 - PubMed

Publication types

LinkOut - more resources