A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

doi:10.1186/s12864-019-6286-9

. 2019 Dec 20;20(Suppl 11):948.

doi: 10.1186/s12864-019-6286-9.

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

Arghya Kusum Das¹, Sayan Goswami², Kisung Lee², Seung-Jong Park²

Affiliations

¹ Department of Computer Science and Software Engineering, University of Wisconsin at Platteville, Platteville, WI, USA. dasa@uwplatt.edu.
² School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton Rouge, Baton Rouge, LA, USA.

PMID: 31856721
PMCID: PMC6923905
DOI: 10.1186/s12864-019-6286-9

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

Arghya Kusum Das et al. BMC Genomics. 2019.

. 2019 Dec 20;20(Suppl 11):948.

doi: 10.1186/s12864-019-6286-9.

Authors

Arghya Kusum Das¹, Sayan Goswami², Kisung Lee², Seung-Jong Park²

Affiliations

¹ Department of Computer Science and Software Engineering, University of Wisconsin at Platteville, Platteville, WI, USA. dasa@uwplatt.edu.
² School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton Rouge, Baton Rouge, LA, USA.

PMID: 31856721
PMCID: PMC6923905
DOI: 10.1186/s12864-019-6286-9

Abstract

Background: Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads.

Methods: In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base.

Results: ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy.

Conclusion: ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.

Keywords: Hadoop; Hybrid error correction; Illumina; NoSQL; PacBio.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Widest Path Example: Select correct path for high coverage error k-mers

**Fig. 2**
Skewness in k-mer coverage statistics

**Fig. 5**
Substitution error correction

**Fig. 6**
De Bruijn graph construction and k-mer count

**Fig. 7**
Scalability of ParLECH. a Time to correct indel error of fruit fly dataset. b Time to correct subst. error of fruit fly dataset

**Fig. 8**
Comparing execution time of ParLECH with existing error correction tools. a Time for hybrid correction of indel errors in *E.coli* long reads (1.032 GB). b Time for correction of substitution errors in *E.coli* short reads (13.50 GB)

See this image and copyright information in PMC

Cited by

Genome sequence assembly algorithms and misassembly identification methods.
Meng Y, Lei Y, Gao J, Liu Y, Ma E, Ding Y, Bian Y, Zu H, Dong Y, Zhu X. Meng Y, et al. Mol Biol Rep. 2022 Nov;49(11):11133-11148. doi: 10.1007/s11033-022-07919-8. Epub 2022 Sep 23. Mol Biol Rep. 2022. PMID: 36151399 Review.
A review of the pangenome: how it affects our understanding of genomic variation, selection and breeding in domestic animals?
Gong Y, Li Y, Liu X, Ma Y, Jiang L. Gong Y, et al. J Anim Sci Biotechnol. 2023 May 5;14(1):73. doi: 10.1186/s40104-023-00860-1. J Anim Sci Biotechnol. 2023. PMID: 37143156 Free PMC article. Review.
Sequencing DNA with nanopores: Troubles and biases.
Delahaye C, Nicolas J. Delahaye C, et al. PLoS One. 2021 Oct 1;16(10):e0257521. doi: 10.1371/journal.pone.0257521. eCollection 2021. PLoS One. 2021. PMID: 34597327 Free PMC article.
NmTHC: a hybrid error correction method based on a generative neural machine translation model with transfer learning.
Wang R, Chen J. Wang R, et al. BMC Genomics. 2024 Jun 7;25(1):573. doi: 10.1186/s12864-024-10446-4. BMC Genomics. 2024. PMID: 38849740 Free PMC article.
Next-generation fungal identification using target enrichment and Nanopore sequencing.
Yu PL, Fulton JC, Hudson OH, Huguet-Tapia JC, Brawner JT. Yu PL, et al. BMC Genomics. 2023 Oct 2;24(1):581. doi: 10.1186/s12864-023-09691-w. BMC Genomics. 2023. PMID: 37784013 Free PMC article.

References

1. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17(6):333–51. doi: 10.1038/nrg.2016.49. - DOI - PMC - PubMed
1. Das AK, Lee K, Park S-J. Parlech: Parallel long-read error correction with hadoop. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE: 2018. p. 341–8. 10.1109/bibm.2018.8621549.
1. Lou D. I., Hussmann J. A., McBee R. M., Acevedo A., Andino R., Press W. H., Sawyer S. L. High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing. Proceedings of the National Academy of Sciences. 2013;110(49):19872–19877. doi: 10.1073/pnas.1319590110. - DOI - PMC - PubMed
1. Kelley David R, Schatz Michael C, Salzberg Steven L. Quake: quality-aware detection and correction of sequencing errors. Genome Biology. 2010;11(11):R116. doi: 10.1186/gb-2010-11-11-r116. - DOI - PMC - PubMed
1. Yang X., Dorman K. S., Aluru S. Reptile: representative tiling for short read error correction. Bioinformatics. 2010;26(20):2526–2533. doi: 10.1093/bioinformatics/btq468. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

[1] Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17(6):333–51. doi: 10.1038/nrg.2016.49. - DOI - PMC - PubMed

[2] Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17(6):333–51. doi: 10.1038/nrg.2016.49. - DOI - PMC - PubMed

[3] Das AK, Lee K, Park S-J. Parlech: Parallel long-read error correction with hadoop. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE: 2018. p. 341–8. 10.1109/bibm.2018.8621549.

[4] Das AK, Lee K, Park S-J. Parlech: Parallel long-read error correction with hadoop. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE: 2018. p. 341–8. 10.1109/bibm.2018.8621549.

[5] Lou D. I., Hussmann J. A., McBee R. M., Acevedo A., Andino R., Press W. H., Sawyer S. L. High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing. Proceedings of the National Academy of Sciences. 2013;110(49):19872–19877. doi: 10.1073/pnas.1319590110. - DOI - PMC - PubMed

[6] Lou D. I., Hussmann J. A., McBee R. M., Acevedo A., Andino R., Press W. H., Sawyer S. L. High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing. Proceedings of the National Academy of Sciences. 2013;110(49):19872–19877. doi: 10.1073/pnas.1319590110. - DOI - PMC - PubMed

[7] Kelley David R, Schatz Michael C, Salzberg Steven L. Quake: quality-aware detection and correction of sequencing errors. Genome Biology. 2010;11(11):R116. doi: 10.1186/gb-2010-11-11-r116. - DOI - PMC - PubMed

[8] Kelley David R, Schatz Michael C, Salzberg Steven L. Quake: quality-aware detection and correction of sequencing errors. Genome Biology. 2010;11(11):R116. doi: 10.1186/gb-2010-11-11-r116. - DOI - PMC - PubMed

[9] Yang X., Dorman K. S., Aluru S. Reptile: representative tiling for short read error correction. Bioinformatics. 2010;26(20):2526–2533. doi: 10.1093/bioinformatics/btq468. - DOI - PubMed

[10] Yang X., Dorman K. S., Aluru S. Reptile: representative tiling for short read error correction. Bioinformatics. 2010;26(20):2526–2533. doi: 10.1093/bioinformatics/btq468. - DOI - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

Affiliations

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Miscellaneous