Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Dec 15;30(24):3506-14.
doi: 10.1093/bioinformatics/btu538. Epub 2014 Aug 26.

LoRDEC: accurate and efficient long read error correction

Affiliations

LoRDEC: accurate and efficient long read error correction

Leena Salmela et al. Bioinformatics. .

Abstract

Motivation: PacBio single molecule real-time sequencing is a third-generation sequencing technique producing long reads, with comparatively lower throughput and higher error rate. Errors include numerous indels and complicate downstream analysis like mapping or de novo assembly. A hybrid strategy that takes advantage of the high accuracy of second-generation short reads has been proposed for correcting long reads. Mapping of short reads on long reads provides sufficient coverage to eliminate up to 99% of errors, however, at the expense of prohibitive running times and considerable amounts of disk and memory space.

Results: We present LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph. In comparison, LoRDEC is at least six times faster and requires at least 93% less memory or disk space than available tools, while achieving comparable accuracy. Availability and implementaion: LoRDEC is written in C++, tested on Linux platforms and freely available at http://atgc.lirmm.fr/lordec.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
An example of short read DBG of order k = 3. For simplicity reverse complement k-mers are ignored
Fig. 2.
Fig. 2.
Long read correction method. (a) A long read is partitioned into weak and solid regions (respectively, lines and rectangles) according to the short read DBG. Weak regions starting or ending the long read are called the head or the tail, respectively, while other weak regions are inner regions. Circles in solid regions represent k-mers of the DBG. k-mers around a weak region serve as source and target nodes to search paths in the DBG. Several source/target pairs are used for each weak inner region. (b) On the second inner region, a bridging path between nodes s1 and t1 is found in the DBG to correct this region. On the third region, the path search fails to find a path between nodes s2 and t2. For the tail, an extension path is sought and found from node s3 toward the end. Once found, the corrective sequence of the path is aligned to the tail to determine the optimal substring (thick dotted arrow)
Fig. 3.
Fig. 3.
Effect of parameters on the runtime and gain of our method. We varied k, solid k-mer threshold, branching limit, maximum error rate and number of target k-mers one at a time, while other parameters were kept constant
Fig. 4.
Fig. 4.
Percentage of the parrot genome covered by raw and corrected reads in function of read depth. The percentages (y-axis in log scale) are plotted for the true alignments (in black) and when considering the alignments are uniformly distributed over the genome (in white). Raw reads are represented by square and corrected reads by circles. The curves for corrected reads dominate that of raw reads, as correction increases the number of reads mapped. The black curves adopt similar shapes, suggesting that correction is not seriously impacted by repeats; their distances to the white curves suggest that a bias related to genomic location is already present in the raw reads

Similar articles

Cited by

References

    1. Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. - PubMed
    1. Au KF, et al. Improving PacBio long read accuracy by short read alignment. PLoS One. 2012;7:e46679. - PMC - PubMed
    1. Bashir A, et al. A hybrid approach for the automated finishing of bacterial genomes. Nat. Biotechnol. 2012;30:701–707. - PMC - PubMed
    1. Cazaux B, et al. CPM, volume 8486 of LNCS. Springer; 2014. From indexing data structures to de bruijn graphs; pp. 89–99.
    1. Chaisson M, et al. Fragment assembly with short reads. Bioinformatics. 2004;20:2067–2074. - PubMed

Publication types