Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 1;35(7):1583-1594.
doi: 10.1101/gr.280383.124.

Verkko2 integrates proximity-ligation data with long-read De Bruijn graphs for efficient telomere-to-telomere genome assembly, phasing, and scaffolding

Affiliations

Verkko2 integrates proximity-ligation data with long-read De Bruijn graphs for efficient telomere-to-telomere genome assembly, phasing, and scaffolding

Dmitry Antipov et al. Genome Res. .

Abstract

The Telomere-to-Telomere Consortium recently finished the first truly complete sequence of a human genome. To resolve the most complex repeats, this project relied on the semimanual combination of long, accurate Pacific Biosciences (PacBio) HiFi and ultralong Oxford Nanopore Technologies sequencing reads. The Verkko assembler later automated this process, achieving complete assemblies for approximately half of the chromosomes in a diploid human genome. However, the first version of Verkko was computationally expensive and could not resolve all regions of a typical human genome. Here we present Verkko2, which implements a more efficient read correction algorithm, improves repeat resolution and gap closing, introduces proximity-ligation-based haplotype phasing and scaffolding, and adds support for multiple long-read data types. These enhancements allow Verkko2 to assemble all regions of a diploid human genome, including the short arms of the acrocentric chromosomes and both sex chromosomes. Together, these changes increase the number of telomere-to-telomere scaffolds by twofold, reduce runtime by fourfold, and improve assembly correctness. On a panel of 19 human genomes, Verkko2 assembles an average of 39 of 46 complete chromosomes as scaffolds, with 21 of these assembled as gapless contigs. Together, these improvements enable telomere-to-telomere comparative genomics and pangenomics, at scale.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
An assembly graph tangle resulting from the rDNA arrays of the HG002 human genome. Bandage (Wick et al. 2015) visualization of the assembly graph for the 10 HG002 acrocentric chromosomes (diploid Chromosomes 13, 14, 15, 21, 22). Maternal and paternal haplotype–assigned nodes are shown in red and blue, respectively; the rDNA repeats, in light green; the distal satellite regions, in orange; and the telomeres, in dark green. Each distal satellite is labeled according to the HG002 v1.0.1 reference assembly, with the exception of Chromosome 13 and Chromosome 22 paternal, which are too similar to be separated (Potapova et al. 2024).The inset shows a linear schematic of a human acrocentric chromosome with colors matching the assembly graph. The paternal and maternal haplotype–assigned sequences comprise the entire q-arm and the proximal component of the p-arm. In the contrast to other regions of the genome, the short arms of the acrocentric chromosomes do not split into a typical diploid arrangement and are difficult to phase correctly. Such complex structures violate common assumptions made by diploid phasing and scaffolding tools.
Figure 2.
Figure 2.
Overview of the Verkko2 Hi-C/Pore-C processing. The process starts with the ULA graph built from the LA and UL sequences. Note that the ULA graph links are only used to cluster the graph into connected components so they are shown as dotted gray lines in the figure. The BWA-MEM/minimap2 step aligns the Hi-C or Pore-C data to the sequences of the ULA graph nodes, and counts connecting pairs of nodes are tallied. Next, the Match Graph step ignores homozygous (based on coverage) and short (by default ≤200 kb) nodes. The remaining nodes are self-aligned to identify homology. The initially computed Hi-C edges are filtered using the Match Graph to build the Hi-C Graph. In many cases, the highest-count Hi-C connection is between homologous pairs of nodes. To avoid these false links, edges connecting potentially repetitive nodes are removed (shown with a value of zero), whereas edges connecting homologous nodes are given large negative weights (shown with −). These updated link weights are used to bipartition each connected component of the graph into two haplotypes. These partitions are then provided to Rukki along with the ULA graph to generate haplotype paths, and the pipeline proceeds as in Verkko1. These haplotype-consistent paths are again used to identify homology, shown with blue edges based on the Match Graph step. All four possible connections are considered to connect the two blue paths based on Hi-C link evidence. In this example, the alignment of both blue paths to the red path adds a multiplicative bonus to one Hi-C connection consistent with the alignment, leading to a scaffold connecting the blue paths.
Figure 3.
Figure 3.
Comparison of tested assemblers with all statistics measured as before. Verkko Hi-C has the highest T2T scaffold rate (except on chicken), followed by Verkko2 trio. Both versions of Hifiasm are comparable to Verkko1. With the exception of Hifiasm Hi-C on chicken and HG00733, all assemblers have comparable hamming error rates. Verkko1 has a consistently higher rate of missing genes compared with other assemblies. All assemblers were run on the NIH Biowulf compute cluster.
Figure 4.
Figure 4.
Verkko2 results on the HPRC year 1 assemblies. The T2T statistics are computed as before. QV and phasing error are computed using yak, and the average of both haplotypes is reported. The core genes are computed using compleasm and reported for each haplotype. The dup categories report single-copy genes present more than once in a haplotype. Because of natural variation, a small number (<1%) of duplicated genes is expected. The missing gene categories report single-copy genes not present in the assembly, excluding the sex chromosomes. The stability of duplicated and missing genes across all samples supports that Verkko2 is accurately reconstructing the full sequence for both haplotypes.

References

    1. Andrews RM, Kubacka I, Chinnery PF, Lightowlers RN, Turnbull DM, Howell N. 1999. Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat Genet 23: 147. 10.1038/13779 - DOI - PubMed
    1. Arrand JR, Rymo L, Walsh JE, Bjürck E, Lindahl T, Griffin BE. 1981. Molecular cloning of the complete Epstein-Barr virus genome as a set of overlapping restriction endonuclease fragments. Nucleic Acids Res 9: 2999–3014. 10.1093/nar/9.13.2999 - DOI - PMC - PubMed
    1. Berlin K, Koren S, Chin C-S, Drake JP, Landolin JM, Phillippy AM. 2015. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 33: 623–630. 10.1038/nbt.3238 - DOI - PubMed
    1. Burkin DJ, Broad TE, Jones C. 1996. The chromosomal distribution and organization of sheep satellite I and II centromeric DNA using characterized sheep-hamster somatic cell hybrids. Chromosome Res 4: 49–55. 10.1007/BF02254945 - DOI - PubMed
    1. Burton JN, Adey A, Patwardhan RP, Qiu R, Kitzman JO, Shendure J. 2013. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotechnol 31: 1119–1125. 10.1038/nbt.2727 - DOI - PMC - PubMed

LinkOut - more resources