Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Apr 26;25(1):107.
doi: 10.1186/s13059-024-03252-4.

NextDenovo: an efficient error correction and accurate assembly tool for noisy long reads

Affiliations

NextDenovo: an efficient error correction and accurate assembly tool for noisy long reads

Jiang Hu et al. Genome Biol. .

Abstract

Long-read sequencing data, particularly those derived from the Oxford Nanopore sequencing platform, tend to exhibit high error rates. Here, we present NextDenovo, an efficient error correction and assembly tool for noisy long reads, which achieves a high level of accuracy in genome assembly. We apply NextDenovo to assemble 35 diverse human genomes from around the world using Nanopore long-read data. These genomes allow us to identify the landscape of segmental duplication and gene copy number variation in modern human populations. The use of NextDenovo should pave the way for population-scale long-read assembly using Nanopore long-read data.

Keywords: Error-correction; Genome assembly; Human genomes; Long reads; Segmental duplication.

PubMed Disclaimer

Conflict of interest statement

De-Peng Wang is the chief executive officer of GrandOmics Biosciences Company. Jiang Hu, Zhuo Wang, Zongyi Sun, Fan Liang, and Jingjin Li are employees of GrandOmics Biosciences Company. The remaining authors have no conflicts of interest to declare.

Figures

Fig. 1
Fig. 1
NextDenovo pipeline. A Overlapping reads. B Alignments erroneously caused by repeats were filtered out and chimeric reads were split. C A confidence score was calculated for a given allele at each position with a fixed 3-mer, and the allele with the maximum score was selected as the correct base. The colored rectangles represent the different bases. D NextDenovo first identifies all LSRs at the raw reads, extracts each subsequence spanning these LSRs, and assigns a kmer score to each subsequence. Subsequently, NextDenovo filters out the subsequences with lower scores and produces a pseudo-LSR seed using a greedy POA consensus algorithm, all pseudo-LSR seeds from the same seed being linked as the reference, and all subsequences being mapped to this reference while the KSC algorithm is reapplied to produce a corrected pseudo seed. Then, the corrected LSRs are inserted into the corresponding positions in the raw reads to generate the final corrected reads. E NextDenovo calculates dovetail alignments by two rounds of overlapping, constructs an assembly graph, removes transitive edges, tips, bubbles, and edges with low scores, and generates contigs. Finally, NextDenovo maps all seeds to contigs and breaks a contig if it possesses low-quality regions
Fig. 2
Fig. 2
De novo assembly of 35 human genomes. A Geographical location of the 35 individuals sequenced. B Comparison of 35 human assemblies between NextDenovo and Flye. NG50 is the length N such that 50% of the reference genome is covered in contigs with length ≥ N. LG50 is the number of contigs with length ≥ NG50. NGA50 is NG50 of the aligned blocks that are obtained by breaking contigs at misassembly events and removing all unaligned bases. LGA50 is the number of aligned blocks with length ≥ NGA50. Misassemblies and QV were evaluated by QUAST, where QV is defined as -10×log10(#mismatchesper100kbp+#indelsper100kbp100kbp). Gene completeness and “multicopy genes retained” are reported by asmgene; “multicopy genes retained” corresponds to the percentage of multicopy genes in the reference genome that remains multicopy genes in the assembly. QV, gene completeness, and “multicopy genes retained” were evaluated using the polished assemblies and other metrics were evaluated using the raw assemblies. The metrics represented by the red points are larger than the metrics represented by the blue points
Fig. 3
Fig. 3
Distribution of duplicate genes and SD hotspots. A Gene symbols within duplications (gene names are marked by numbers and are shown in the subfigures). B Bar plots of SD hotspots in African/non-African genomes. C Coverage plot of 35 human genome assemblies. D Colored map of peri/centromeric satellite DNA (αSat: alpha satellite DNA, βSat: beta satellite DNA, HSat: human satellite DNA; see [10] for more detailed definitions). Ideogram plot was built from the T2T-CHM13 (v2) genome. Annotations of peri/centromeric and cytoband regions were downloaded from UCSC (https://hgdownload.soe.ucsc.edu/gbdb/hs1/)

References

    1. Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, et al. Real-time DNA sequencing from single polymerase molecules. Science. 2009;323:133–138. doi: 10.1126/science.1162986. - DOI - PubMed
    1. Branton D, Deamer DW, Marziali A, Bayley H, Benner SA, Butler T, Di Ventra M, Garaj S, Hibbs A, Huang X, et al. The potential and challenges of nanopore sequencing. Nat Biotechnol. 2008;26:1146–1153. doi: 10.1038/nbt.1495. - DOI - PMC - PubMed
    1. Wenger AM, Peluso P, Rowell WJ, Chang PC, Hunkapiller MW. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol. 2019;37:1155–1162. doi: 10.1038/s41587-019-0217-9. - DOI - PMC - PubMed
    1. Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021;18:170–175. doi: 10.1038/s41592-020-01056-5. - DOI - PMC - PubMed
    1. Nurk S, Walenz BP, Rhie A, Vollger MR, Logsdon GA, Grothe R, Miga KH, Eichler EE, Phillippy AM, Koren S. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 2020;30:1291–1305. doi: 10.1101/gr.263566.120. - DOI - PMC - PubMed

Publication types

LinkOut - more resources