Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 20;34(11):1908-1918.
doi: 10.1101/gr.279311.124.

Telomere-to-telomere assembly by preserving contained reads

Affiliations

Telomere-to-telomere assembly by preserving contained reads

Sudhanva Shyam Kamath et al. Genome Res. .

Abstract

Automated telomere-to-telomere (T2T) de novo assembly of diploid and polyploid genomes remains a formidable task. A string graph is a commonly used assembly graph representation in the assembly algorithms. The string graph formulation employs graph simplification heuristics, which drastically reduce the count of vertices and edges. One of these heuristics involves removing the reads contained in longer reads. In practice, this heuristic occasionally introduces gaps in the assembly by removing all reads that cover one or more genome intervals. The factors contributing to such gaps remain poorly understood. In this work, we mathematically derived the frequency of observing a gap near a germline and a somatic heterozygous variant locus. Our analysis shows that (1) an assembly gap due to contained read deletion is an order of magnitude more frequent in Oxford Nanopore Technologies (ONT) reads than Pacific Biosciences high-fidelity (PacBio HiFi) reads due to differences in their read-length distributions, and (2) this frequency decreases with an increase in the sequencing depth. Drawing cues from these observations, we addressed the weakness of the string graph formulation by developing the repeat-aware fragmenting tool (RAFT) assembly algorithm. RAFT addresses the issue of contained reads by fragmenting reads and producing a more uniform read-length distribution. The algorithm retains spanned repeats in the reads during the fragmentation. We empirically demonstrate that RAFT significantly reduces the number of gaps using simulated data sets. Using real ONT and PacBio HiFi data sets of the HG002 human genome, we achieved a twofold increase in the contig NG50 and the number of haplotype-resolved T2T contigs compared to hifiasm.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Assembly gaps and their occurrence frequency. (A) An example of a sequencing output where an assembly gap occurs in the string graph due to contained read deletion. Read r3 is contained in read r1. Read r8 is contained in read r7. Accordingly, the string graph representation excludes reads r3 and r8. Read r3 is redundant; its deletion simplifies the graph. However, removing read r8 breaks the connectivity between reads r5 and r9, which was necessary to spell the second haplotype. (B) Fraction of sequencing outputs containing an assembly gap. We measured the fractions using the read-length distributions corresponding to three sequencing technologies (PacBio HiFi, ONT Duplex, ONT Simplex) and using different sequencing depths. Here, we used equal sequencing depths on both haplotypes. (C,D) Fraction of sequencing outputs containing an assembly gap when the sequencing depths across the two haplotypes are uneven. This scenario models somatic mutation in DNA with variant allele frequency below 0.5. In (C), the total sequencing depth for both haplotypes is 50×. In (D), the total sequencing depth is 100×.
Figure 2.
Figure 2.
Illustration of the RAFT algorithm and its usage for genome assembly. (A) Flowchart of an assembly workflow that uses RAFT. RAFT accepts error-corrected long reads and all-to-all alignment information as input. It produces a revised set of fragmented reads with a narrow read-length distribution. (B) Illustration of the RAFT algorithm. Read A (shown in red) is sampled from a nonrepetitive region of the genome. Accordingly, RAFT fragments read A into shorter uniform-length reads. Read B (shown in pink) spans a repetitive region of the genome. RAFT detects the repetitive interval in read B because more than the expected number of sequences align to that interval. The portions of read B outside the repetitive interval are split into shorter reads. (C) The impact of RAFT can be seen on a set of ONT Duplex reads sampled from the HG002 human genome. The range of the read lengths is significantly reduced by using RAFT. The original data set comprises 3.7 million reads with a skewed read-length distribution. After fragmentation, the data set comprises 6.8 million reads.
Figure 3.
Figure 3.
Illustration of conditions that lead to an assembly gap due to contained read deletion. (A) An example of a sequencing output that is affected by the deletion of contained reads r6 and r7. Removing contained reads r6 and r7 introduces an assembly gap on haplotype 2. (B) An example of a sequencing output where contained read deletion does not introduce an assembly gap. Read r6 supports the sampling interval of contained read r7 after its deletion.

References

    1. Allenby RB, Slomson A. 2010. How to count: an introduction to combinatorics. CRC Press, New York.
    1. Baaijens JA, El Aabidine AZ, Rivals E, Schönhuth A. 2017. De novo assembly of viral quasispecies using overlap graphs. Genome Res 27: 835–848. 10.1101/gr.215038.116 - DOI - PMC - PubMed
    1. Bankevich A, Bzikadze AV, Kolmogorov M, Antipov D, Pevzner PA. 2022. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nat Biotechnol 40: 1075–1081. 10.1038/s41587-022-01220-6 - DOI - PubMed
    1. Cheng H, Concepcion GT, Feng X, Zhang H, Li H. 2021. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18: 170–175. 10.1038/s41592-020-01056-5 - DOI - PMC - PubMed
    1. Cheng H, Jarvis ED, Fedrigo O, Koepfli KP, Urban L, Gemmell NJ, Li H. 2022. Haplotype-resolved assembly of diploid genomes without parental data. Nat Biotechnol 40: 1332–1335. 10.1038/s41587-022-01261-x - DOI - PMC - PubMed

LinkOut - more resources