. 2014 Dec 15;30(24):3506-14.

doi: 10.1093/bioinformatics/btu538. Epub 2014 Aug 26.

LoRDEC: accurate and efficient long read error correction

Leena Salmela¹, Eric Rivals¹

Affiliations

Affiliation

¹ Department of Computer Science and Helsinki Institute for Information Technology HIIT, FI-00014 University of Helsinki, Finland and LIRMM and Institut de Biologie Computationelle, CNRS and Université Montpellier, 34095 Montpellier Cedex 5, France.

PMID: 25165095
PMCID: PMC4253826
DOI: 10.1093/bioinformatics/btu538

LoRDEC: accurate and efficient long read error correction

Leena Salmela et al. Bioinformatics. 2014.

. 2014 Dec 15;30(24):3506-14.

doi: 10.1093/bioinformatics/btu538. Epub 2014 Aug 26.

Authors

Leena Salmela¹, Eric Rivals¹

Affiliation

¹ Department of Computer Science and Helsinki Institute for Information Technology HIIT, FI-00014 University of Helsinki, Finland and LIRMM and Institut de Biologie Computationelle, CNRS and Université Montpellier, 34095 Montpellier Cedex 5, France.

PMID: 25165095
PMCID: PMC4253826
DOI: 10.1093/bioinformatics/btu538

Abstract

Motivation: PacBio single molecule real-time sequencing is a third-generation sequencing technique producing long reads, with comparatively lower throughput and higher error rate. Errors include numerous indels and complicate downstream analysis like mapping or de novo assembly. A hybrid strategy that takes advantage of the high accuracy of second-generation short reads has been proposed for correcting long reads. Mapping of short reads on long reads provides sufficient coverage to eliminate up to 99% of errors, however, at the expense of prohibitive running times and considerable amounts of disk and memory space.

Results: We present LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph. In comparison, LoRDEC is at least six times faster and requires at least 93% less memory or disk space than available tools, while achieving comparable accuracy. Availability and implementaion: LoRDEC is written in C++, tested on Linux platforms and freely available at http://atgc.lirmm.fr/lordec.

PubMed Disclaimer

Figures

**Fig. 1.**
An example of short read DBG of order k = 3. For simplicity reverse complement k-mers are ignored

**Fig. 2.**
Long read correction method. (a) A long read is partitioned into weak and solid regions (respectively, lines and rectangles) according to the short read DBG. Weak regions starting or ending the long read are called the *head* or the *tail*, respectively, while other weak regions are *inner regions*. Circles in solid regions represent k-mers of the DBG. k-mers around a weak region serve as source and target nodes to search paths in the DBG. Several source/target pairs are used for each weak inner region. (b) On the second inner region, a *bridging path* between nodes s₁ and t₁ is found in the DBG to correct this region. On the third region, the path search fails to find a path between nodes s₂ and t₂. For the tail, an *extension path* is sought and found from node s₃ toward the end. Once found, the corrective sequence of the path is aligned to the tail to determine the optimal substring (thick dotted arrow)

**Fig. 3.**
Effect of parameters on the runtime and gain of our method. We varied k, solid k-mer threshold, branching limit, maximum error rate and number of target k-mers one at a time, while other parameters were kept constant

**Fig. 4.**
Percentage of the parrot genome covered by raw and corrected reads in function of read depth. The percentages (y-axis in log scale) are plotted for the true alignments (in black) and when considering the alignments are uniformly distributed over the genome (in white). Raw reads are represented by square and corrected reads by circles. The curves for corrected reads dominate that of raw reads, as correction increases the number of reads mapped. The black curves adopt similar shapes, suggesting that correction is not seriously impacted by repeats; their distances to the white curves suggest that a bias related to genomic location is already present in the raw reads

See this image and copyright information in PMC

Cited by

SPAligner: alignment of long diverged molecular sequences to assembly graphs.
Dvorkina T, Antipov D, Korobeynikov A, Nurk S. Dvorkina T, et al. BMC Bioinformatics. 2020 Jul 24;21(Suppl 12):306. doi: 10.1186/s12859-020-03590-7. BMC Bioinformatics. 2020. PMID: 32703258 Free PMC article.
Characterization and Analysis of the Full-Length Transcriptomes of Multiple Organs in Pseudotaxus chienii (W.C.Cheng) W.C.Cheng.
Liu L, Wang Z, Su Y, Wang T. Liu L, et al. Int J Mol Sci. 2020 Jun 17;21(12):4305. doi: 10.3390/ijms21124305. Int J Mol Sci. 2020. PMID: 32560294 Free PMC article.
Draft genomic sequence of Armillaria gallica 012m: insights into its symbiotic relationship with Gastrodia elata.
Zhan M, Tian M, Wang W, Li G, Lu X, Cai G, Yang H, Du G, Huang L. Zhan M, et al. Braz J Microbiol. 2020 Dec;51(4):1539-1552. doi: 10.1007/s42770-020-00317-x. Epub 2020 Jun 22. Braz J Microbiol. 2020. PMID: 32572836 Free PMC article.
Sequence data for Clostridium autoethanogenum using three generations of sequencing technologies.
Utturkar SM, Klingeman DM, Bruno-Barcena JM, Chinn MS, Grunden AM, Köpke M, Brown SD. Utturkar SM, et al. Sci Data. 2015 Apr 14;2:150014. doi: 10.1038/sdata.2015.14. eCollection 2015. Sci Data. 2015. PMID: 25977818 Free PMC article.
Comprehensive profiling of epigenetic modifications in fast-growing Moso bamboo shoots.
Li T, Wang H, Zhang Y, Wang H, Zhang Z, Liu X, Zhang Z, Liu K, Yang D, Zhang H, Gu L. Li T, et al. Plant Physiol. 2023 Feb 12;191(2):1017-1035. doi: 10.1093/plphys/kiac525. Plant Physiol. 2023. PMID: 36417282 Free PMC article.

See all "Cited by" articles

References

1. Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. - PubMed
1. Au KF, et al. Improving PacBio long read accuracy by short read alignment. PLoS One. 2012;7:e46679. - PMC - PubMed
1. Bashir A, et al. A hybrid approach for the automated finishing of bacterial genomes. Nat. Biotechnol. 2012;30:701–707. - PMC - PubMed
1. Cazaux B, et al. CPM, volume 8486 of LNCS. Springer; 2014. From indexing data structures to de bruijn graphs; pp. 89–99.
1. Chaisson M, et al. Fragment assembly with short reads. Bioinformatics. 2004;20:2067–2074. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

LoRDEC: accurate and efficient long read error correction

Affiliation

LoRDEC: accurate and efficient long read error correction

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases