Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan-Feb;18(1):365-372.
doi: 10.1109/TCBB.2019.2913845. Epub 2021 Feb 3.

Analysis of Subtelomeric REXTAL Assemblies Using QUAST

Analysis of Subtelomeric REXTAL Assemblies Using QUAST

Tunazzina Islam et al. IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb.

Abstract

Genomic regions of high segmental duplication content and/or structural variation have led to gaps and misassemblies in the human reference sequence, and are refractory to assembly from whole-genome short-read datasets. Human subtelomere regions are highly enriched in both segmental duplication content and structural variations, and as a consequence are both impossible to assemble accurately and highly variable from individual to individual. Recently, we developed a pipeline for improved region-specific assembly called Regional Extension of Assemblies Using Linked-Reads (REXTAL). In this study, we evaluate REXTAL and genome-wide assembly (Supernova) approaches on 10X Genomics linked-reads data sets partitioned and barcoded using the Gel Bead in Emulsion (GEM) microfluidic method. Our results describe the accuracy and relative performance of these two approaches using the reference-based assessment module of QUAST. We show that REXTAL dramatically outperforms the Supernova whole genome assembler in subtelomeric segmental duplication regions, and results in highly accurate assemblies. Nearly all of the REXTAL "misassemblies" identified using default QUAST parameters simply pinpoint locations of tandem repeat arrays in the reference sequence where the repeat array length differs from that in the cognate REXTAL assembly by 1000 bp.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Overview of REXTAL workflow.
Fig. 2.
Fig. 2.
Contig alignment viewer of Icarus for the segmental duplication region region of 18p and 22q. Each viewer has 2 parts containing 3 rows in each part. The top green bars of the top part represents the masked reference as tandem repeat markers with white breaks in this track shows the positions and sizes of tandem repeats in the reference. The 2nd and 3rd rows show the REXTAL and the genome-wide assemblies respectively. The bottom part having 3 rows represents the assembly overview with the highlighted yellow box indicates the region expanded in the top 3 rows. A. The 2nd and the 3rd row of top part represent the contigs generated by REXTAL and genome-wide method for 18p correspondingly. B. The 2nd and the 3rd row of top part represent the contigs generated by REXTAL and genome-wide method for 22q correspondingly.
Fig. 3.
Fig. 3.
Contig alignment viewer of Icarus for the segmental duplication region of 16q_2nd.The top green bars of the top part represents the masked reference as tandem repeat markers with white breaks in this track shows the positions and sizes of tandem repeats in the reference. The 2nd row represents the contigs generated by REXTAL and two red blocks represents the misassembled contig with gap 1512 bp. 3rd row is supposed to be the contigs generated by genome-wide method for segmental duplication region of 16q_2nd and this row shows nothing here because genome-wide method could not extend the assembly up to this point. The bottom three rows represent the assembly overview with the highlighted yellow box indicates the region expanded in the top 3 rows. Note that the “misassembled contig” is in fact a gap in the contig corresponds exactly to a tandem repeat. It is called a QUAST misassembly only because it exceeds the 1000 bp default when aligned to the reference sequence.
Fig. 4.
Fig. 4.
Contig alignment viewer of Icarus for 1-copy region of 19q and 17q. Each viewer has 2 parts containing 3 rows in each part. The top green bars of the top part represents the masked reference as tandem repeat markers with white breaks in this track shows the positions and sizes of tandem repeats in the reference. The 2nd and 3rd rows show the REXTAL and the genome-wide assemblies respectively. The bottom part having 3 rows represents the assembly overview with the highlighted yellow box indicates the region expanded in the top 3 rows. A. The 2nd row (including the expansion (yellow area)) represents the contigs generated by REXTAL and the 3rd row represents the contigs generated by genome-wide method for 19q. The expanded version of 2nd row shows that misassembled contig has seven blocks among them two blocks (red blocks) are misassembled because of relocation with inconsistency = 1115. This misassembled contig is located entirely within other higher-quality contig (1 green block in 2nd row). B. The 2nd row represents the contigs generated by REXTAL and the 3rd row represents the contigs generated by genome-wide method for 17q. The misassembled contig has four blocks (in assembly overview image there is a light yellow rectangle representing the selected region and four down arrows (↓) represent four blocks in one contig.). Among them two blocks (red blocks) are misassembled with inconsistency= 1168. These two misassembled blocks are in one contig in REXTAL assembly but two different contigs in genome-wide assembly. In the selected region of genome-wide method has seven different assembled contigs whether REXTAL has one contig with four blocks with gaps. Note that the “misassembled contig” is in fact a gap in the contig corresponds exactly to a tandem repeat. It is called a QUAST misassembly only because it exceeds the 1000 bp default when aligned to the reference sequence.
Fig. 5.
Fig. 5.
Contig alignment viewer of Icarus for for the bait segment into adjacent DNA including segmental duplication region of 17p and 2p. Each viewer has 2 parts containing 3 rows in each part. The top green bars of the top part represents the masked reference as tandem repeat markers with white breaks in this track shows the positions and sizes of tandem repeats in the reference. The 2nd and 3rd rows show the REXTAL and the genome-wide assemblies respectively. The bottom part having 3 rows represents the assembly overview with the highlighted yellow box indicates the region expanded in the top 3 rows. A. The 2nd row represents the contigs generated by REXTAL and the 3rd row represents the contigs generated by genome-wide method for 17p. There are four red blocks in a contig those are misassembled because of relocation with inconsistency value 1920, 1172, and 1055. B. The 2nd row represents the contigs generated by REXTAL and the 3rd row represents the contigs generated by genome-wide method for 2p. The two red block represents the misassembly because of 2935 bp gap between two blocks within a contig. Note that the “misassembled contig” is in fact a gap in the contig corresponds exactly to a tandem repeat. It is called a QUAST misassembly only because it exceeds the 1000 bp default when aligned to the reference sequence.

References

    1. Islam T et al., “REXTAL: Regional Extension of Assemblies Using Linked-Reads,” International Symposium on Bioinformatics Research and Applications, pp. 63–78, 2018. - PMC - PubMed
    1. Weisenfeld NI, Kumar V, Shah P, Church DM, Jaffe DB, “Direct determination of diploid genome sequences,” Genome research, 27, pp. 757–767, 2017. - PMC - PubMed
    1. Zheng GX-L-P et al., “Haplotyping germline and cancer genomes with high-throughput linked-read sequencing,” Nature biotechnology, 34, pp. 303–311, 2016. - PMC - PubMed
    1. Gurevich A, Saveliev V, Vyahhi N, Tesler G, “QUAST: quality assessment tool for genome assemblies,” Bioinformatics, 29, pp. 1072–1075, 2013. - PMC - PubMed
    1. Barthelson R, et al., “Plantagora: modeling whole genome sequencing and assembly of plant genomes,” PLoS One, 6:e28436, 2011. - PMC - PubMed

Publication types