Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 11;6(4):lqae159.
doi: 10.1093/nargab/lqae159. eCollection 2024 Dec.

GIN-TONIC: non-hierarchical full-text indexing for graph genomes

Affiliations

GIN-TONIC: non-hierarchical full-text indexing for graph genomes

Ünsal Öztürk et al. NAR Genom Bioinform. .

Abstract

This paper presents a new data structure, GIN-TONIC (Graph INdexing Through Optimal Near Interval Compaction), designed to index arbitrary string-labelled directed graphs representing, for instance, pangenomes or transcriptomes. GIN-TONIC provides several capabilities not offered by other graph-indexing methods based on the FM-Index. It is non-hierarchical, handling a graph as a monolithic object; it indexes at nucleotide resolution all possible walks in the graph without the need to explicitly store them; it supports exact substring queries in polynomial time and space for all possible walk roots in the graph, even if there are exponentially many walks corresponding to such roots. Specific ad-hoc optimizations, such as precomputed caches, allow GIN-TONIC to achieve excellent performance for input graphs of various topologies and sizes. Robust scalability capabilities and a querying performance close to that of a linear FM-Index are demonstrated for two real-world applications on the scale of human pangenomes and transcriptomes. Source code and associated benchmarks are available on GitHub.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Example segment of a complex region of a string graph. Vertices are denoted with rectangles, and edges with arrows, and vertex labels as text inside rectangles representing vertices. Walk roots corresponding to the string CATACA—equivalent to vertex index and offsets—are marked with vertical arrows pointing to vertex labels. The walks inducing CATACA starting from the marked walk roots are underlined. Note that one root may induce multiple walks of the example string, that walks stemming from different roots may partially overlap, and that the same path can contain more than one match. Locate queries output all walk roots (vertical arrows), and Walk queries output all paths (segments underlining vertex labels).
Figure 2.
Figure 2.
Illustrated examples of how to build and query a GIN-TONIC. (I) A string graph of five vertices and eight edges. (II) Arbitrarily sampled constraint sets for three string label prefixes A, AT, AG generated according to the procedure in the section ‘The vertex permutation’. (III) Codeword assignment to enforce a permutation preserving a consecutive order of the constraint sets in (II). (IV) The single-vertex range translator over σ0. formula image contains the ranges of formula image in the permutation given in (III). (V) LF-Mapping constructed on the graph encoding with the permutation obtained in (III) with formula image and formula image. The LF-Mapping is partitioned into chunks of 10 characters for visual and algorithm tracing convenience. (VI) IMT based on the single range mapper formula image. (VII) Matching Q = AAT using the data structures in V and VI. The query is successively extended one character at a time starting from its end to the beginning as in the linear FM-Index.
Figure 3.
Figure 3.
Benchmarking GIN-TONIC (fast) on adversarial queries. Left panels are for the pangenome, right panels for the transcriptome. Panels (A) and (B) show baseline performance using neither cache nor permutation, only permutation, only cache; and performance of different GIN-TONIC implementations (fast, native and sdsl-based) using both cache and permutation as characters matched per second. Panels (C) and (D) show average Δ(Q) as a function of cache depth. Caches of depth 0 and 1 have very similar performances close to the baseline since they cannot traverse more than one vertex, and hence they appear superimposed in the plots. Panels (E) and (F) show Δ(Q) versus characters matched per second, grouped by cache depth (dashed lines) and average effective query length (solid lines).

References

    1. Paten B., Novak A. M., Eizenga J. M., Garrison E.. Genome graphs and the evolution of genome inference. Genome Res. 2017; 27:665–676. - PMC - PubMed
    1. Liao W.-W., Asri M., Ebler J., Doerr D., Haukness M., Hickey G., Lu S., Lucas J. K., Monlong J., Abel H. J.et al. .. A draft human pangenome reference. Nature. 2023; 617:312–324. - PMC - PubMed
    1. Wang T., Antonacci-Fulton L., Howe K., Lawson H. A., Lucas J. K., Phillippy A. M., Popejoy A. B., Asri M., Carson C., Chaisson M. J. P.et al. .. The Human Pangenome Project: a global resource to map genomic diversity. Nature. 2022; 604:437–446. - PMC - PubMed
    1. The Human Pangenome Reference Consortium Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. 2023; 42:663–673. - PMC - PubMed
    1. Liu B., Liu Y., Li J., Guo H., Zang T., Wang Y.. deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index. Genome Biol. 2019; 20:274. - PMC - PubMed

LinkOut - more resources