Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jan 13;14(1):204.
doi: 10.1038/s41467-022-35670-y.

GALA: a computational framework for de novo chromosome-by-chromosome assembly with long reads

Affiliations

GALA: a computational framework for de novo chromosome-by-chromosome assembly with long reads

Mohamed Awad et al. Nat Commun. .

Abstract

High-quality genome assembly has wide applications in genetics and medical studies. However, it is still very challenging to achieve gap-free chromosome-scale assemblies using current workflows for long-read platforms. Here we report on GALA (Gap-free long-read Assembly tool), a computational framework for chromosome-based sequencing data separation and de novo assembly implemented through a multi-layer graph that identifies discordances within preliminary assemblies and partitions the data into chromosome-scale scaffolding groups. The subsequent independent assembly of each scaffolding group generates a gap-free assembly likely free from the mis-assembly errors which usually hamper existing workflows. This flexible framework also allows us to integrate data from various technologies, such as Hi-C, genetic maps, and even motif analyses to generate gap-free chromosome-scale assemblies. As a proof of principle we de novo assemble the C. elegans genome using combined PacBio and Nanopore sequencing data and a rice cultivar genome using Nanopore sequencing data from publicly available datasets. We also demonstrate the proposed method's applicability with a gap-free assembly of the human genome using PacBio high-fidelity (HiFi) long reads. Thus, our method enables straightforward assembly of genomes with multiple data sources and overcomes barriers that at present restrict the application of de novo genome assembly technology.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of GALA.
After de novo assembling with various tools, preliminary assemblies and raw reads are encoded into a multi-layer computer graph. Mis-assemblies are identified with MDM by browsing through the inter-layer information. The split nodes are clustered into multiple linage groups by the CCM. Each scaffolding group is assembled independently using SGAM to achieve the final gap-free sequences of chromosomes.
Fig. 2
Fig. 2. Illustration of a multi-layer computer graph in GALA.
a The preliminary assemblies and raw reads are aligned against each other and encoded into a multi-layer graph. Conflicted alignments are encoded with edges in red. b The conflicted alignments are removed iteratively by splitting the nodes involved and new edges are assigned accordingly. The procedure stops only after all conflicted alignments in the system have been resolved. c Nodes connected by edges are clustered into scaffolding groups.
Fig. 3
Fig. 3. Comparison of Flye assembly with Hi-C scaffolding and GALA assembly of long reads of the C. elegans genome.
a The Flye assembly with Hi-C scaffolding contains numerous gaps and 13 unanchored contigs in the assembly. b GALA produces gap-free assembly for each chromosome. Note this is not a fair comparison since GALA did not use Hi-C data in this assembly.
Fig. 4
Fig. 4. Human genome assembly by GALA.
a Comparison of the number of contigs in assemblies by Canu and GALA. b A cartoon presentation of each chromosome assembled by GALA with the lengths of contigs labelled, created with BioRender.com.
Fig. 5
Fig. 5. The assembly performances of GALA changes with the coverages of sequencing data and with the number and the quality of preliminary assemblies.
To investigate the effect of sequencing coverage, three assembly procedures have been tested using C. elegans PacBio sequencing data: GALA without Hi-C data, Flye/Hi-C and GALA/Hi-C. The assembly performances are evaluated in terms of a the number of scaffolds, b N50, and c the number of big gaps (>16 Kbp) and mis-assemblies. In (c), only the number of gaps and mis-assemblies for Flye/Hi-C have been shown for simplicity, as GALA shows overwhelming advantage and only one mis-assembly has been identified in the assembly by GALA without Hi-C data with 30× coverage of sequencing data. In (d), the number of scaffolding groups obtained by GALA changes significantly with the number and the quality of preliminary assemblies used in the Oryza sativa circum-basmati landrace Dom Sufid genome assembly with the Nanopore sequencing data. Here n = 35, 35, 21, 7 and 1 for the exhaustive strategy, and n = 15, 20, 15, 6, and 1 for the selective strategy, respectively. The median (line), 1st and 3rd quartiles (bounds of the box), minimum and maximum (whiskers) are shown in the box plot.
Fig. 6
Fig. 6. Comparison of the overlap graphs used by Miniasm during assembly of a region in the C. elegans genomes when the chromosome-by-chromosome strategy is applied or not.
a In the whole genome assembly mode, the overlap graph used by Miniasm contains numerous edges and extra effort is needed to collapse edges. b The chromosome-by-chromosome assembly allows a linear overlap graph to be derived by Miniasm in the same region.

References

    1. Cao MD, et al. Scaffolding and completing genome assemblies in real-time with nanopore sequencing. Nat. Commun. 2017;8:14515. doi: 10.1038/ncomms14515. - DOI - PMC - PubMed
    1. Li, C., Lin, F., An, D., Wang, W. & Huang, R. Genome sequencing and assembly by long reads in plants. Genes9, 6 (2017). - PMC - PubMed
    1. Xiao CL, et al. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat. Methods. 2017;14:1072–1074. doi: 10.1038/nmeth.4432. - DOI - PubMed
    1. Koren S, et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017;27:722–736. doi: 10.1101/gr.215087.116. - DOI - PMC - PubMed
    1. Kolmogorov M, Yuan J, Lin Y, Pevzner PA. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 2019;37:540–546. doi: 10.1038/s41587-019-0072-8. - DOI - PubMed

Publication types