Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Aug 15:arXiv:2308.07877v1.

Genome assembly in the telomere-to-telomere era

Affiliations

Genome assembly in the telomere-to-telomere era

Heng Li et al. ArXiv. .

Update in

Abstract

De novo assembly is the process of reconstructing the genome sequence of an organism from sequencing reads. Genome sequences are essential to biology, and assembly has been a central problem in bioinformatics for four decades. Until recently, genomes were typically assembled into fragments of a few megabases at best but technological advances in long-read sequencing now enable near complete chromosome-level assembly, also known as telomere-to-telomere assembly, for many organisms. Here we review recent progress on assembly algorithms and protocols. We focus on how to derive near telomere-to-telomere assemblies and discuss potential future developments.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement The authors declare no competing interests.

Figures

Figure 1|
Figure 1|. Strategy for near telomere-to-telomere assembly.
a, Assembling a haploid or homozygous genome. After sequencing errors on accurate long reads are corrected, error-free reads are assembled into an initial assembly graph, where a thick arrow denotes a sequence, and a thin line connects sequences. Ultra-long reads are then threaded through the assembly graph to resolve tangled subgraphs and patch small assembly gaps. Long-range data such as Hi-C helps to scaffold across remaining gaps. b, Assembling a heterozygous diploid genome. Heterozygous differences between haplotypes are preserved during error correction. The assembly graphs often consist of a chain of “bubbles”, representing polymorphisms between haplotypes. Ultra-long reads and long-range data can be used to phase haplotypes as well as resolve tangles.
Figure 2|
Figure 2|. Types of phased assembly of diploid samples.
a, The assembly graph from Fig. 1b can be further processed into different types of assemblies. b, Primary/alternate assembly pair. The primary assembly represents a complete haploid genome with occasional phase switches. The alternate assembly is fragmented. c, A pair of dual assemblies. Each dual assembly is similar to a primary assembly. d, A pair of chromosome-phased assemblies. Contigs from the same haploid chromosome are partitioned to the same assembly. e, A pair of chromosome-phased assemblies with scaffolding. Contigs are joined into chromosomes across assembly gaps.
Figure 3|
Figure 3|. Assembly with overlap graphs.
a, Simple overlap graph assembly. Find overlaps between all reads, identify transitive overlaps (dashed arrows) that can be inferred from other overlaps, remove transitive overlaps, and merge vertices with one incoming edge and one outgoing edge to get the final unitigs. b, Graph cleaning. An uncorrected sequencing error (yellow hexagon) may lead to a tip (read 3) that should be trimmed off. Repeats (red regions) may result in overlaps between repeat copies that can be cut with graph cleaning. c, Assembling a tandem duplication longer than reads. Disallowing inexact overlaps (red arrows) resolves the region into a simple graph. d, Assembling a diploid sample. Allowing inexact overlaps leads to the loss of heterozygous differences and collapses the two haplotypes. Using only exact overlaps eliminates alignments between haplotypes and thus preserves the heterozygous alleles and their local phasing. e, Removing contained reads (yellow lines) leads to assembly gaps on the red haplotypes.
Figure 4|
Figure 4|. De Bruijn graphs.
a, Node(vertex)-centric de Bruijn graphs of a string of different k-mer lengths. b, Multiplex DBG improves assembly. The compacted de Bruijn graph using 6-mers as nodes, DBGv(6), is fragmented into two unitgs. DBGv(5) has one connected component but the graph has a cycle. A multiplex de Bruijn graph, DBGv(5,6), is conceptually constructed from the combined set of unitigs in DBGv(5) and DBGv(6), using 6-mers as nodes. c, However, multiplex DBG does not resolve all cases. In this case, the multiplex DBG is still fragmented, while an overlap-based method (requiring ≥4bp overlaps) assembles to a single contig (as in b).

References

    1. C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012–2018 (1998). - PubMed
    1. Schneider V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017). - PMC - PubMed
    1. Lander E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001). - PubMed
    1. Myers E. W. et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000). - PubMed
    1. Venter J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001). - PubMed

Publication types

LinkOut - more resources