Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul 29;12(8):jkac143.
doi: 10.1093/g3journal/jkac143.

Assembly of complete diploid-phased chromosomes from draft genome sequences

Affiliations

Assembly of complete diploid-phased chromosomes from draft genome sequences

Andrea Minio et al. G3 (Bethesda). .

Abstract

De novo genome assembly is essential for genomic research. High-quality genomes assembled into phased pseudomolecules are challenging to produce and often contain assembly errors because of repeats, heterozygosity, or the chosen assembly strategy. Although algorithms that produce partially phased assemblies exist, haploid draft assemblies that may lack biological information remain favored because they are easier to generate and use. We developed HaploSync, a suite of tools that produces fully phased, chromosome-scale diploid genome assemblies, and performs extensive quality control to limit assembly artifacts. HaploSync scaffolds sequences from a draft diploid assembly into phased pseudomolecules guided by a genetic map and/or the genome of a closely related species. HaploSync generates a report that visualizes the relationships between current and legacy sequences, for both haplotypes, and displays their gene and marker content. This quality control helps the user identify misassemblies and guides Haplosync's correction of scaffolding errors. Finally, HaploSync fills assembly gaps with unplaced sequences and resolves collapsed homozygous regions. In a series of plant, fungal, and animal kingdom case studies, we demonstrate that HaploSync efficiently increases the assembly contiguity of phased chromosomes, improves completeness by filling gaps, corrects scaffolding, and correctly phases highly heterozygous, complex regions.

Keywords: assembly error correction; chromosome anchoring; diploid genomes; haplotype phasing; hybrid genome assembly.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
The HaploSync pipeline builds and refines haploid and diploid genome assemblies. The diploid-aware pipeline can deliver fully phased diploid pseudomolecules using a draft diploid assembly or diploid pseudomolecules as input. If draft sequences are used, Haplosplit first separates the haplotypes into 2 pseudomolecule sets. Pseudomolecules provided by the user or reconstructed with HaploSplit, then undergo quality control with HaploDup. If errors are found, input sequences can be edited with HaploBreak prior to rebuilding the pseudomolecules with HaploSplit. If no errors are detected and there are unplaced sequences, the pseudomolecule undergoes gap-filling with HaploFill. After each filling iteration, quality control can be performed with HaploDup. Finally, HaploMap can be used to identify colinear regions between pseudomolecules.
Fig. 2.
Fig. 2.
The HaploSplit procedure using genetic markers as input. a) The procedure identifies marker positions in the draft sequences. b) The longest sorted set of markers is identified for each draft sequence. c) Each sequence is assigned to a unique genomic region in the map (linkage group) and oriented. d) A directed adjacency network of nonoverlapping sequences is built for each linkage group connecting all sequences with no overlapping ranges of genetic markers. Sequences sharing markers are placed in separate network paths. e) The tiling path that maximizes the number of covered markers is selected for the first haplotype. f) Sequences belonging to the first haplotype are removed from the adjacency network and the second-best tiling path is used to scaffold the second haplotype.
Fig. 3.
Fig. 3.
Example of HaploDup’s interactive reports. The figure reports 2 static screenshots exemplifying HaploDup interactive output. a) Assembly quality control of M. rotundifolia chromosome 12 Haplotype 1: whole-sequence alignment of both alternative haplotypes on Haplotype 1, legacy contig and hybrid scaffold composition of Haplotype 1, position of the genetic markers and the duplicated markers in Haplotype 1, number of significant alignment(s) per gene of Haplotype 1 in each alternative haplotype. In this example, the composition in legacy contigs and position of duplicated markers indicate that both alleles (primary contig and haplotig) and both marker copies were placed in a hybrid scaffold (overlayed box). b) Unplaced sequence quality control: Marker content is compared between pseudomolecules and unplaced sequences to evaluate conditions that prevent the inclusion of a specific unplaced sequence. Color-coding is used for better contextualization. Markers are color-coded based on their order in the map. The structure of pseudomolecules and unplaced sequences are represented with color-coded blocks. Blocks identify the composition in terms of draft assembly sequences, color coding is used to show the existing relationships between the composing sequences (e.g. primary to haplotig relationships). In this example, the presence of a marker (overlayed box, the dark marker on the right of the contig) in the unplaced sequences far from its expected position on the map extends the expected coverage of the map to the end of the linkage group and prevents placement in any haplotype scaffold.
Fig. 4.
Fig. 4.
HaploSplit performance. a) The results of using different sources of external information and HaploSplit protocols for V. vinifera cv. Cabernet Franc cl. 04 (Vondras et al. 2021) assembly. Map-based assembly produces the largest first haplotype, but its overassembly occurs at the expense of the second haplotype’s completeness. A map-based approach is conservative and limited by the density of the markers. The hybrid approach recovers more sequences where the map is lacking information, without overassembling, and delivers a better reconstruction of both haplotypes. b) Effect of limited marker availability on overall assembly length tested on B. taurus Angus × Brahma (Koren et al. 2018; Low et al. 2020) by subsampling the genetic map. Longer sequences are more likely to contain a marker, making the first reconstructed haplotype most complete across all tests and with little variation in size. As the number of available markers increases and short sequences are included, the completeness of the second haplotype improves. c) Effect of limited marker availability on the number of placed sequences tested on B. taurus Angus × Brahma (Koren et al. 2018; Low et al. 2020) by subsampling the genetic map. Increasing the number of markers as fragmentation increases allows recruiting more sequences for scaffolding and improves completeness. Haplotype 1, with long sequences, shows little variation. In contrast, Haplotype 2 greatly benefits from increased marker density. The majority of sequences that remained unplaced are short and a small fraction of the genome’s length.

References

    1. Alonge M, Soyk S, Ramakrishnan S, Wang X, Goodwin S, Sedlazeck FJ, Lippman ZB, Schatz MC.. RaGOO: fast and accurate reference-guided scaffolding of draft genomes. Genome Biol. 2019;20(1):17. - PMC - PubMed
    1. Barchi L, Pietrella M, Venturini L, Minio A, Toppino L, Acquadro A, Andolfo G, Aprea G, Avanzato C, Bassolino L, et al.A chromosome-anchored eggplant genome sequence reveals key events in Solanaceae evolution. Sci Rep. 2019;9(1):1–13. - PMC - PubMed
    1. Bongartz P, Schloissnig S.. Deep repeat resolution—the assembly of the Drosophila Histone Complex. Nucleic Acids Res. 2019;47(3):e18. - PMC - PubMed
    1. Canaguier A, Grimplet J, Di Gaspero G, Scalabrin S, Duchêne É, Choisne N, Mohellibi N, Guichard C, Rombauts S, Le Clainche I, et al.A new version of the grapevine reference genome assembly (12X.v2) and of its annotation (VCost.v3). Genom Data. 2017;14:56–62. - PMC - PubMed
    1. Chin C-S, Peluso P, Sedlazeck FJ, Nattestad M, Concepcion GT, Clum A, Dunn C, O’Malley R, Figueroa-Balderas R, Morales-Cruz A, et al.Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods. 2016;13(12):1050–1054. - PMC - PubMed

Publication types