Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Sep 11;24(1):205.
doi: 10.1186/s13059-023-03022-8.

PhaseDancer: a novel targeted assembler of segmental duplications unravels the complexity of the human chromosome 2 fusion going from 48 to 46 chromosomes in hominin evolution

Affiliations

PhaseDancer: a novel targeted assembler of segmental duplications unravels the complexity of the human chromosome 2 fusion going from 48 to 46 chromosomes in hominin evolution

Barbara Poszewiecka et al. Genome Biol. .

Abstract

Resolving complex genomic regions rich in segmental duplications (SDs) is challenging due to the high error rate of long-read sequencing. Here, we describe a targeted approach with a novel genome assembler PhaseDancer that extends SD-rich regions of interest iteratively. We validate its robustness and efficiency using a golden-standard set of human BAC clones and in silico-generated SDs with predefined evolutionary scenarios. PhaseDancer enables extension of the incomplete complex SD-rich subtelomeric regions of Great Ape chromosomes orthologous to the human chromosome 2 (HSA2) fusion site, informing a model of HSA2 formation and unravelling the evolution of human and Great Ape genomes.

Keywords: Chromosomal fusion; Complex genomic rearrangements; De-novo assembly; Long-read PacBio sequencing; Segmental duplications.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
A workflow of the PhaseDancer algorithm and the accompanying tools. PhaseDancer works with next generation sequencing long-read data e.g. Oxford Nanopore or PacBio. Starting with an initial anchor sequence, the core workflow of PhaseDancer iterates along four major steps: (i) mapping the reads on the anchor sequence, (ii) clustering the mapped reads and selection of a cluster with the reads originating from the genomic region represented by the anchor sequence, (iii) assembling these reads into a contig, and (iv) extending the current anchor sequence using the contig to a new anchor sequence processed in the next iteration. After all iterations, the algorithm outputs the final assembled sequence. PhaseDancer is also accompanied with two supporting tools - the semi-supervised character of PhaseDancer is complemented by PhaseDancerViewer that enables the intermediate control of assembly process, whereas PhaseDancerSimulator generates in silico data for profound validation of the algorithm. Thanks to its high efficiency, PhaseDancer can be used for resolving challenging genomic tasks, involving segmental duplication (SD) assembly
Fig. 2
Fig. 2
An overview of SDs characteristics and the study motivation. Based on the most recent T2T human genome assembly: A A contour plot of the SD abundance given their sequence identity (90–100%, x axis) and the total length (Mb, y-axis, log-scale), where the blue colour intensifies with the increasing number of SDs; B A barplot of the SDs total length (Mb, log-scale, y-axis) given the total number of SDs copies (x-axis) located at the interstitial (top, blue) and non-interstitial (bottom, yellow) genomic regions; C An area plot of the SDs’ total length (Mb, log-scale, y-axis) for SDs with at least given number of copies (x-axis) and the minimal percent of sequence identity (area colour). Here, the number of stacked SDs per base is the number of reads overlapping a given base position of the reference genome. D A normalised depth-of-coverage histogram of the aligned whole-genome circular consensus sequencing (CCS) reads in the human (NA12878), two chimpanzees (Clint, Chaos), bonobo (Mhudilbu), and gorilla (Kamilah) genomic regions syntenic to those flanking the HSA2 fusion site. For bonobo and both chimpanzees two depth-of-coverage tracks are shown. The top track presents the full scale of all data, whereas the bottom track zooms-in the coverage of values excluding the extremely high coverage region. The red line on each of the top tracks indicates the y-axis limit of the bottom track. Note the high coverage of the ~31 kb fragment previously found to be amplified about 400 times in the chimp genome [19]. E Optical genome mapping was used to assess the current incompleteness of the subtelomeric assemblies in chimpanzee and bonobo genomes (panTro5, panTro6, and panPan3). Each of the subtelomeric ends was estimated to lack at least 0.3 Mb of the DNA sequence
Fig. 3
Fig. 3
Time complexity, feasibility, and correctness of PhaseDancer. A Computational time performance (y-axis) for different number of stacked SDs (x-axis) and processes (colour scale). Each boxplot represents 100 iterations of PhaseDancer for a given setting. B Feasibility space for SDs in human. PhaseDancer resolves all SDs with the number of stacked SDs per base as for SDs identified by T2T human genome assembly (area plot, Fig. 2C). For a given number of stacked SDs (x-axis) the height of each bar indicates an average runtime of PhaseDancer iteration (right y-axis) along with a standard deviation (error bars) and individual measurements (points). C The evaluation of PhaseDancer assemblies using the Phred Quality Score (Q; y-axis). The samples used for evaluation were generated by PhaseDancerSimulator, with fixed parameters including a coverage of 40x, an average read length of 18 kb, and a read length standard deviation of 3 kb. The x-axis represents different sequencing error levels, while the colour scale indicates different numbers of cis-morphisms per 10 kb window. The additional upper panel in the figure shows the percentage of assembly tasks with no errors (Q > 60) using bar plots. Remarkably, our analyses revealed no significant changes in assembly quality for different PhaseDancerSimulator topologies (SDs evolutionary scenarios). D Correctness of the PhaseDancer assemblies was assessed using optical genome mapping (OGM). All HSA2 syntenic sites of the chimpanzee genome were in concordance with the corresponding OGM molecules (BssSI enzyme shown)
Fig. 4
Fig. 4
Genome architecture flanking the HSA2 fusion site and the syntenic genomic regions in Great Apes and human. From the top, the figure depicts the sequences from: orangutan (PAB) and gorilla (GGO) chromosomes 2Apter and 2Bpter; chimpanzee (PTR) and bonobo (PPA) chromosomes 2Apter, 2Bpter, 9pter, 12pter and 22qter; and human HSA2, all together with the corresponding coding regions track. Each individual contig is represented by a uniquely coloured stripe consistent among species/chromosomes, labelled with the coordinates with respect to the human genome build (hg38) and designated with the arrowheads indicating the DNA strand. Dark grey contigs with white crosses depict strongly mosaic SDs or tandem repeats that cannot be graphically presented in a legible way. Brown arrowheads depict the TAR1 satellite and degenerate telomeric repeats at the HSA2 fusion site and their orthologs in Great Apes. Below each contig assembly a coloured stripe depicts: (i) green - the novel reconstructed assembly along with an approximate size, (ii) pink - the high homology region between chromosomes 2Apter and 2Bpter presumably triggering the fusion event, and (iii) light blue - the region that was lost after the fusion event with respect to the HSA2. HSA2 is also equipped with a track of collapsed SDs including ~190 kb fragment homologous to HSA9pter and three fragments ~68 kb in size in total homologous to HSA22pter. The azure contig (chr2:113,523-113,554 kb) was found to be amplified ~400 times in the chimpanzee genome [19]
Fig. 5
Fig. 5
The proposed model for the evolutionary HSA2 fusion event based on the assembled SD-rich subtelomeric sequences in Great Apes chromosomes, absent in the reference genomes. The fusion site is flanked proximally and distally, respectively, by the ~190 kb and ~68 kb SDs homologous to human chromosomes 9p24.3 and 22q13.33 (98.9% and 97.8-99.1% sequence identity). The ~190 kb fragment harbouring FOXD4L1 (red solid rectangle) (Fig. 4), and likely originating from an ancestral locus syntenic to chromosome 9q21.11 in human, was previously shown to be duplicatively transposed to chromosome PTR2Apter after gorilla had branched off the common chimp-human ancestor lineage (Additional file 1: Table S3-S5) [20, 36, 46, 47]. Both copies flank the evolutionarily pericentromeric inversion in the human and chimp genomes that arose after the gorilla divergence [36, 45, 47]. We have proposed that a portion of the PTR9pter copy was also copied onto chromosome PTR22qter and later PTR2Bter before the gorilla-chimp divergence [36, 45, 48, 49]. Importantly, our assemblies revealed substantially long homology (~190kb) between the lost fragments (within the yellow band) of the ancestral chromosomes 2Apter (Pre HSA2A) and 2Bpter (Pre HSA2B) that might have served as a substrate of misalignment during meiosis. The fusion occurred within TAR1 satellite and degenerate telomeric repeats present in both Pre HSA2Apter and Pre HSA2Bpter. Submicroscopic subtelomeric rearrangements in human are relatively common cause of genomic imbalances in patients with developments delay/intellectual disability [50]. Analyses of these sequences showed that two copies of the following six protein coding genes FOXD4L1, JMJD7-PLA2G4B, MAPKBP1, SPTBN5, CBWD2, and MALRD1, one pseudogene PGM5P4, and three lncRNAs LINC01881, LINC01961, and PGM5P4-AS1 might have been lost during the fusion event (Fig. 4, Additional file 1: Fig. S9)

Similar articles

References

    1. Huddleston J, Ranade S, Malig M, Antonacci F, Chaisson M, Hon L, et al. Reconstructing complex regions of genomes using long-read sequencing technology. Genome Res. 2014;24(4):688–696. doi: 10.1101/gr.168450.113. - DOI - PMC - PubMed
    1. Sohn J-I, Nam J-W. The present and future of de novo whole-genome assembly. Brief Bioinforma. 2016;096. 10.1093/bib/bbw096. - PubMed
    1. Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 2020;21(1). 10.1186/s13059-020-1935-5. - PMC - PubMed
    1. Marques-Bonet T, Girirajan S, Eichler EE. The origins and impact of primate segmental duplications. Trends Genet. 2009;25(10):443–454. doi: 10.1016/j.tig.2009.08.002. - DOI - PMC - PubMed
    1. Stankiewicz P, Shaw CJ, Withers M, Inoue K, Lupski JR. Serial segmental duplications during primate evolution result in complex human genome architecture. Genome Res. 2004;14(11):2209–2220. doi: 10.1101/gr.2746604. - DOI - PMC - PubMed

Publication types