Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
- PMID: 28298431
- PMCID: PMC5411767
- DOI: 10.1101/gr.215087.116
Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
Abstract
Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.
© 2017 Koren et al.; Published by Cold Spring Harbor Laboratory Press.
Figures






Similar articles
-
HINGE: long-read assembly achieves optimal repeat resolution.Genome Res. 2017 May;27(5):747-756. doi: 10.1101/gr.216465.116. Epub 2017 Mar 20. Genome Res. 2017. PMID: 28320918 Free PMC article.
-
Improved assembly of noisy long reads by k-mer validation.Genome Res. 2016 Dec;26(12):1710-1720. doi: 10.1101/gr.209247.116. Epub 2016 Oct 7. Genome Res. 2016. PMID: 27831497 Free PMC article.
-
Fast and accurate de novo genome assembly from long uncorrected reads.Genome Res. 2017 May;27(5):737-746. doi: 10.1101/gr.214270.116. Epub 2017 Jan 18. Genome Res. 2017. PMID: 28100585 Free PMC article.
-
One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly.Curr Opin Microbiol. 2015 Feb;23:110-20. doi: 10.1016/j.mib.2014.11.014. Epub 2014 Dec 1. Curr Opin Microbiol. 2015. PMID: 25461581 Review.
-
Plant genome sequence assembly in the era of long reads: Progress, challenges and future directions.Quant Plant Biol. 2022 Mar 11;3:e5. doi: 10.1017/qpb.2021.18. eCollection 2022. Quant Plant Biol. 2022. PMID: 37077982 Free PMC article. Review.
Cited by
-
Generation and Genetic Stability of a PolX and 5' MGF-Deficient African Swine Fever Virus Mutant for Vaccine Development.Vaccines (Basel). 2024 Sep 30;12(10):1125. doi: 10.3390/vaccines12101125. Vaccines (Basel). 2024. PMID: 39460292 Free PMC article.
-
Cotton D genome assemblies built with long-read data unveil mechanisms of centromere evolution and stress tolerance divergence.BMC Biol. 2021 Jun 3;19(1):115. doi: 10.1186/s12915-021-01041-0. BMC Biol. 2021. PMID: 34082735 Free PMC article.
-
Chromosome-level assemblies from diverse clades reveal limited structural and gene content variation in the genome of Candida glabrata.BMC Biol. 2022 Oct 8;20(1):226. doi: 10.1186/s12915-022-01412-1. BMC Biol. 2022. PMID: 36209154 Free PMC article.
-
Multiple Horizontal Mini-chromosome Transfers Drive Genome Evolution of Clonal Blast Fungus Lineages.Mol Biol Evol. 2024 Aug 2;41(8):msae164. doi: 10.1093/molbev/msae164. Mol Biol Evol. 2024. PMID: 39107250 Free PMC article.
-
Genome Sequence and Characterization of Acinetobacter Phage DMU1.Phage (New Rochelle). 2021 Mar 1;2(1):50-56. doi: 10.1089/phage.2020.0043. Epub 2021 Mar 17. Phage (New Rochelle). 2021. PMID: 36148435 Free PMC article.
References
-
- Berlin K, Koren S, Chin CS, Drake JP, Landolin JM, Phillippy AM. 2015. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 33: 623–630. - PubMed
-
- Böhringer S, Gödde R, Böhringer D, Schulte T, Epplen JT. 2002. A software package for drawing ideograms automatically. Online J Bioinformatics 1: 51–61.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
Miscellaneous