Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr;39(4):422-430.
doi: 10.1038/s41587-020-00747-w. Epub 2020 Dec 14.

Efficient hybrid de novo assembly of human genomes with WENGAN

Affiliations

Efficient hybrid de novo assembly of human genomes with WENGAN

Alex Di Genova et al. Nat Biotechnol. 2021 Apr.

Abstract

Generating accurate genome assemblies of large, repeat-rich human genomes has proved difficult using only long, error-prone reads, and most human genomes assembled from long reads add accurate short reads to polish the consensus sequence. Here we report an algorithm for hybrid assembly, WENGAN, that provides very high quality at low computational cost. We demonstrate de novo assembly of four human genomes using a combination of sequencing data generated on ONT PromethION, PacBio Sequel, Illumina and MGI technology. WENGAN implements efficient algorithms to improve assembly contiguity as well as consensus quality. The resulting genome assemblies have high contiguity (contig NG50: 17.24-80.64 Mb), few assembly errors (contig NGA50: 11.8-59.59 Mb), good consensus quality (QV: 27.84-42.88) and high gene completeness (BUSCO complete: 94.6-95.2%), while consuming low computational resources (CPU hours: 187-1,200). In particular, the WENGAN assembly of the haploid CHM13 sample achieved a contig NG50 of 80.64 Mb (NGA50: 59.59 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50: 57.88 Mb).

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. The WENGAN algorithm.
The WENGAN workflow consists of first assembling and error-correcting the short-read contigs (1 and 2), creating a spectrum of synthetic mate-pair libraries from long reads (3) and building of the SSG (4). The SSG is used to compute approximate long-read overlaps by building long-read-coherent paths (5 and 6). The long-read overlaps restore the long-read information and facilitate the construction and validation of the assembly backbone (7 and 8). The SSG is used to fill the gaps by building for each mate edge a consensus sequence using the partial order alignment graph (9). In the final step, the SSG is used to polish the consensus sequences (10). The repeat contigs (2–10) are drawn uncollapsed to explain the WENGAN steps.
Fig. 2
Fig. 2. WENGAN assemblies of the haploid CHM13 genome.
a, A bar plot showing the contig NG50/NGA50 of WENGAN and other state-of-the-art long-read assemblers, as well as of the current human reference genomes. NG50 is the contig length such that using longer contigs produces half (50%) of the bases of the reference genome. NGA50 is NG50 corrected of assembly errors. NG50 and NGA50 were computed using as genome size the total contig lengths of GRCh38 (2.94 Gb). b, Assembly errors predicted by QUAST using the GRCh38 reference (autosomes plus X and Y). Assembly errors overlapping centromeric regions or SDs were excluded from the analysis. c, Consensus quality assessment by alignment of 30 unique BAC sequences to the assembled contigs using the BACVALIDATION tool. d, Gene completeness was determined using the BUSCO tool. e, SDs resolved by the genome assemblies. An SD is considered resolved if the aligned contig extends the SD flanking sequences by at least 50 kb (see Methods). Different CHM13 assemblers are represented using the same color across ae. Source data
Fig. 3
Fig. 3. BIONANO scaffolding of the WENGAN assemblies of CHM13.
We show the largest super-scaffold produced by merging the BIONANO map (BNG) and the WENGAN (WG) contigs generated by combining ultralong Nanopore reads (rel3) with PacBio/HiFi (20 kb) or Illumina (2 × 250 bp) reads. The name of the scaffolded WENGAN (WSC) contigs is displayed. The square brackets in the contig name indicate that the contig was corrected by the BIONANO map, and the numbers are the start–stop coordinates of the error-free contig region. In round brackets, we show the contig orientation in the super-scaffold. The white text in the alignments displays the number of matching nicking sites, the total number of nicking sites in the BNG contig and the length in megabases of the alignment. The blue bar in the BNG contigs shows examples of joins guided by the WENGAN contigs. Source data
Fig. 4
Fig. 4. De novo genome assemblies of NA12878 when varying the long-read coverage and the short-read technology.
a, The de novo assemblies were sorted by NG50 at each long-read coverage (lolliplot). We computed the NGA50 (which corresponds to the NG50 corrected of assembly errors) of each assembly using QUAST (see Methods). b, The consensus quality (see Methods) of each genome assembly and the CPU hours required for the assembly. c, The WENGAN (W-X) and FLYE assemblies of the complex MHC region located in Chr6: 28,477,797–33,448,354 (4.97 Mb). The MHC sequence was aligned to the genome assemblies and the aligned blocks ≥30 kb with a minimum identity of 95% were kept. The alignment breakpoints (vertical black lines) indicate a contig switch, an alignment error or a gap in the assembly. The assemblies of the MHC region are displayed in tracks by long-read coverage. Source data

References

    1. Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc. Natl Acad. Sci. USA. 2001;98:9748–9753. doi: 10.1073/pnas.171285098. - DOI - PMC - PubMed
    1. Myers EW, et al. A whole-genome assembly of Drosophila. Science. 2000;287:2196–2204. doi: 10.1126/science.287.5461.2196. - DOI - PubMed
    1. Myers EW. The fragment assembly string graph. Bioinformatics. 2005;21:ii79–ii85. doi: 10.1093/bioinformatics/bti1114. - DOI - PubMed
    1. Chin C-S, et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods. 2016;13:1050–1054. doi: 10.1038/nmeth.4035. - DOI - PMC - PubMed
    1. Koren S, et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017;27:722–736. doi: 10.1101/gr.215087.116. - DOI - PMC - PubMed

Publication types