Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul 19;37(12):1644-1651.
doi: 10.1093/bioinformatics/btab005.

Genome-scale de novo assembly using ALGA

Affiliations

Genome-scale de novo assembly using ALGA

Sylwester Swat et al. Bioinformatics. .

Abstract

Motivation: There are very few methods for de novo genome assembly based on the overlap graph approach. It is considered as giving more exact results than the so-called de Bruijn graph approach but in much greater time and of much higher memory usage. It is not uncommon that assembly methods involving the overlap graph model are not able to successfully compute greater datasets, mainly due to memory limitation of a computer. This was the reason for developing in last decades mainly de Bruijn-based assembly methods, fast and fairly accurate. However, the latter methods can fail for longer or more repetitive genomes, as they decompose reads to shorter fragments and lose a part of information. An efficient assembler for processing big datasets and using the overlap graph model is still looked out.

Results: We propose a new genome-scale de novo assembler based on the overlap graph approach, designed for short-read sequencing data. The method, ALGA, incorporates several new ideas resulting in more exact contigs produced in short time. Among these ideas, we have creation of a sparse but quite informative graph, reduction of the graph including a procedure referring to the problem of minimum spanning tree of a local subgraph, and graph traversal connected with simultaneous analysis of contigs stored so far. What is rare in genome assembly, the algorithm is almost parameter-free, with only one optional parameter to be set by a user. ALGA was compared with nine state-of-the-art assemblers in tests on genome-scale sequencing data obtained from real experiments on six organisms, differing in size, coverage, GC content and repetition rate. ALGA produced best results in the sense of overall quality of genome reconstruction, understood as a good balance between genome coverage, accuracy and length of resulting sequences. The algorithm is one of tools involved in processing data in currently realized national project Genomic Map of Poland.

Availability and implementation: ALGA is available at http://alga.put.poznan.pl.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
An example of the construction of a sparse graph. (A) Three overlapping reads. (B) The algorithm starts with the overlap area set to its minimum value and increases it once overlaps of such a length have been checked within the whole dataset. The first found edge is (a, c), weighted by the appropriate offset. (C) After increasing the overlap area to β+γ, edge (b, c) is found, and it replaces (a, c) recognized at this moment as a transitive edge because of the presence of the supplementing connection from a to b. This connection can be quickly checked with bitwise operations, for the offset between a and b is already known. (D) Edge (a, b) is added once the overlap area rises to α+β
Fig. 2.
Fig. 2.
Solving the minimum directed spanning tree problem in a local subarea. (A) Reads from the example. (B) Subgraph Sa identified in a larger graph for vertex a, here D=250. Sa contains a directed cycle and no edge is deleted in this iteration. (C) Result of the reduction for subgraph Sa identified for vertex a. Vertex h is not reachable from a within distance D, so edges (g, h) and (h, f) do not belong to Sa
Fig. 3.
Fig. 3.
Results shown as the dependency of NG50 values and the length of misassemblies. NG50 means the length of a contig that together with at least such long contigs cover half of a reference genome. Total misassembly length is the sum of lengths of inaccurate contigs (being contigs misassembled, unaligned or partially unaligned to a reference genome). The better the results are, the closer to the right-bottom part of the graph they are visualized. Axes are in logarithmic scale
Fig. 4.
Fig. 4.
Results shown as the part (per cent) of a reference genome covered by contigs aligned to it, depending on the minimal length of contigs taken into account. X axis is in logarithmic scale

References

    1. 1001 Genomes Consortium (2016) 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell, 166, 481–491. - PMC - PubMed
    1. Ameur A. et al. (2019) Single-molecule sequencing: towards clinical applications. Trends Biotechnol., 37, 72–85. - PubMed
    1. Bankevich A. et al. (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol., 19, 455–477. - PMC - PubMed
    1. Blazewicz J. et al. (2009) Whole genome assembly from 454 sequencing output via modified DNA graph concept. Comput. Biol. Chem., 33, 224–230. - PubMed
    1. Blazewicz J. et al. (2002) A heuristic managing errors for DNA sequencing. Bioinformatics, 18, 652–660. - PubMed