Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Sep 24;21(1):253.
doi: 10.1186/s13059-020-02157-2.

GraphAligner: rapid and versatile sequence-to-graph alignment

Affiliations

GraphAligner: rapid and versatile sequence-to-graph alignment

Mikko Rautiainen et al. Genome Biol. .

Abstract

Genome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome graphs is key to many applications, including error correction, genome assembly, and genotyping of variants in a pangenome graph. Yet, so far, this step is often prohibitively slow. We present GraphAligner, a tool for aligning long reads to genome graphs. Compared to the state-of-the-art tools, GraphAligner is 13x faster and uses 3x less memory. When employing GraphAligner for error correction, we find it to be more than twice as accurate and over 12x faster than extant tools.Availability: Package manager: https://anaconda.org/bioconda/graphaligner and source code: https://github.com/maickrau/GraphAligner.

Keywords: Error correction; Genome graphs; Long reads; Pangenome; Sequence alignment.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Fraction of reads correctly aligned at varying read lengths for the variant graph and the linear graph. Left: reads simulated from the GRCh37 reference. Right: reads simulated from de novo assembled contigs of HG00733
Fig. 2
Fig. 2
Overview of the error correction pipeline. The circles represent data and the rectangles programs
Fig. 3
Fig. 3
Overview of GraphAligner. Reads are aligned independently of the other reads. Seed hits are found by matching the sequence of the read to sequences inside nodes (small blue and green bars). Seed hits are clustered in locally acyclic parts of the graph and scored. Seed hits are then extended (small dotted boxes) with a banded dynamic programming algorithm, using Viterbi’s algorithm to decide when to clip the alignment (red X). Each seed hit can result in an alignment (blue and green paths). Alignments that overlap with an another, longer alignment in the query sequence are classified as secondary. Secondary alignments are discarded by default (red X) but can be included in the output with an optional parameter. The output is then written to a file either as alignments or corrected reads
Fig. 4
Fig. 4
Converting a bidirected graph with variable edge overlaps to an alignment graph. Top: a bidirected graph with three nodes. The edges are labeled by their overlap. The red-colored bars represent the same sequence, which should not be duplicated during traversal. Similarly, the orange-colored bars represent the same sequence. Bottom: the alignment graph created from the top graph. The colors of the base pairs show how they match between the two graphs, with each sequence in the original graph represented by the same color in the alignment graph twice, once for the forward strand and once for the reverse complement. Similarly to the bidirected graph, the red and orange bars represent the same sequences. There are two subgraphs, one representing the forward traversal (top) and one the backward traversal (bottom) with reverse complemented node labels. Each edge introduces a breakpoint in the target node, splitting the node at the boundary of the overlap. The alignment graph then connects the ends of the overlap such that the overlapping sequence is only traversed once
Fig. 5
Fig. 5
Seeding. Top: A graph with four nodes. Middle: The node sequences are extracted from the nodes. The arrows represent a mapping between the strings and nodes. Bottom: A read. Highlighted in red: Matches between the read and a string are converted into a match inside a node using the mapping
Fig. 6
Fig. 6
Building a minimizer index from a graph. Only the nodes of the graph are considered when building the index, and edges are ignored. Each node has an ID and a sequence. At the start, all nodes are labeled as unprocessed. Threads pick nodes one at a time from the pool of unprocessed nodes, and find minimizers in the node sequence. Then, the threads distribute the minimizers into buckets according to the modulo of their k-mer. Once all nodes have been processed, the threads proceed to index the buckets. Each thread picks one bucket and indexes it into a bucket index. The bucket index contains an array of the minimizers in that bucket sorted by the k-mer, a bitvector representing indices where a k-mer is different from the previous one, and a minimal perfect hash function which assigns each k-mer to the rank of the bit which represents the first instance of that k-mer in the sorted array
Fig. 7
Fig. 7
A chain of superbubbles containing three superbubbles. The solid circles are nodes connected by directed edges. The dashed circles show the three superbubbles. The three superbubbles start and end at A and B, B and C, and C and D, respectively. The three superbubbles form one chain of superbubbles since they share start and end nodes
Fig. 8
Fig. 8
Left: regular banded alignment with b=3. The reference is on top and the query on the left. The gray cells are inside the band and are calculated. The blue line shows the traceback of the optimal alignment. Right: score based banding with b=1. The reference is on top and the query on the left. The gray cells are inside the band and the blue line is the traceback. The red-circled cells are the minimum for each row, which are discovered during the calculation of the matrix and define whether a cell is inside the band or not; a cell is inside the band if its score is within b of the minimum score in the same row. The cells with a number on a white background are calculated to discover the end of the band, but they are not inside the band and are ignored when calculating the next row. The band can wander around the DP matrix and change size, automatically spreading wider in high error areas and narrower in low error areas. Note that the score-based banding parameter is 1 in comparison to 3 in the regular banding to the left. The implementation uses a coarser band of 64 x 64 blocks instead of individual cells
Fig. 9
Fig. 9
Dynamic score-based banding applied on a graph. Top: an alignment graph. Bottom: The DP matrix for aligning a read to the above graph. The arrows show the correspondence between nodes in the graph and columns in the DP matrix. The dotted lines separate the nodes. The gray area represents parts of the DP matrix which are calculated, and the parts in the white area are not calculated. At each fork, the band spreads to all out-neighbors. The score-based banding implicitly limits the exploration of the alternate paths; as the scores in the alternate paths become worse than the optimal path, the explored part shrinks until finally the exploration stops completely. The blue line shows the backtrace path
Fig. 10
Fig. 10
Sparse storage of the DP matrix. Each node is stored in blocks of 64 rows and up to 64 columns. The scores of the corner cells (solid black) are stored explicitly, using 4 bytes per cell. The border cells (gray) are stored with a score difference, using 2 bits per cell. The middle cells (white) are not stored
Fig. 11
Fig. 11
An example of using Viterbi’s algorithm for estimating correctness per slice. The error rates of the minimum alignment per slice (not shown in figure) are the observations. The numbers represent the probability of the alignment being in the specific state at the specific slice. The arrows represent the predecessor state for each state in each slice. Slice 2 is guaranteed correct since the predecessor for the wrong state in slice 3 is through the correct state. Similarly, slice 4 is guaranteed wrong since the predecessor for the correct state in slice 5 is through the wrong state. None of the other slices are guaranteed correct or wrong. The final backtrace will consider slices 0, 1, and 2 correctly aligned and slices 3 to 5 wrongly aligned, and only the sequence in slices 0–2 will be reported in the alignment

References

    1. Computational Pan-Genomics Consortium Computational pan-genomics: status, promises and challenges. Brief Bioinforma. 2016;19(1):118–35. - PMC - PubMed
    1. Paten B, Novak AM, Eizenga JM, Garrison E. Genome graphs and the evolution of genome inference. Genome Res. 2017;27(5):665–76. - PMC - PubMed
    1. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19(5):455–77. - PMC - PubMed
    1. Antipov D, Korobeynikov A, McLean JS, Pevzner PA. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics. 2015;32(7):1009–15. - PMC - PubMed
    1. Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol. 2017;13(6):1005595. - PMC - PubMed

Publication types