GIN-TONIC: non-hierarchical full-text indexing for graph genomes

Ünsal Öztürk¹, Marco Mattavelli¹, Paolo Ribeca^{2

3

4

5}

Affiliations

¹ SCI-STI-MM, EPFL, ELB 118, Station 11, 1015, Lausanne, Switzerland.
² Biomathematics and Statistics Scotland, The James Hutton Institute, Peter Guthrie Tait Road, EH9 3FD, Edinburgh, United Kingdom.
³ Clinical and Emerging Infection, UK Health Security Agency, 61 Colindale Avenue, NW9 5EQ, London, United Kingdom.
⁴ NIHR Health Protection Research Unit in Genomics and Enabling Data, University of Warwick, Gibbet Hill Road, CV4 7AL, Coventry, United Kingdom.
⁵ NIHR Health Protection Research Unit in Gastrointestinal Infections, University of Liverpool, 8 West Derby Street, L69 7BE, Liverpool, United Kingdom.

PMID: 39664816
PMCID: PMC11632618
DOI: 10.1093/nargab/lqae159

GIN-TONIC: non-hierarchical full-text indexing for graph genomes

Ünsal Öztürk et al. NAR Genom Bioinform. 2024.

. 2024 Dec 11;6(4):lqae159.

doi: 10.1093/nargab/lqae159. eCollection 2024 Dec.

Authors

Ünsal Öztürk¹, Marco Mattavelli¹, Paolo Ribeca^{2

3

4

5}

Affiliations

¹ SCI-STI-MM, EPFL, ELB 118, Station 11, 1015, Lausanne, Switzerland.
² Biomathematics and Statistics Scotland, The James Hutton Institute, Peter Guthrie Tait Road, EH9 3FD, Edinburgh, United Kingdom.
³ Clinical and Emerging Infection, UK Health Security Agency, 61 Colindale Avenue, NW9 5EQ, London, United Kingdom.
⁴ NIHR Health Protection Research Unit in Genomics and Enabling Data, University of Warwick, Gibbet Hill Road, CV4 7AL, Coventry, United Kingdom.
⁵ NIHR Health Protection Research Unit in Gastrointestinal Infections, University of Liverpool, 8 West Derby Street, L69 7BE, Liverpool, United Kingdom.

PMID: 39664816
PMCID: PMC11632618
DOI: 10.1093/nargab/lqae159

Abstract

This paper presents a new data structure, GIN-TONIC (Graph INdexing Through Optimal Near Interval Compaction), designed to index arbitrary string-labelled directed graphs representing, for instance, pangenomes or transcriptomes. GIN-TONIC provides several capabilities not offered by other graph-indexing methods based on the FM-Index. It is non-hierarchical, handling a graph as a monolithic object; it indexes at nucleotide resolution all possible walks in the graph without the need to explicitly store them; it supports exact substring queries in polynomial time and space for all possible walk roots in the graph, even if there are exponentially many walks corresponding to such roots. Specific ad-hoc optimizations, such as precomputed caches, allow GIN-TONIC to achieve excellent performance for input graphs of various topologies and sizes. Robust scalability capabilities and a querying performance close to that of a linear FM-Index are demonstrated for two real-world applications on the scale of human pangenomes and transcriptomes. Source code and associated benchmarks are available on GitHub.

PubMed Disclaimer

Figures

**Figure 1.**
Example segment of a complex region of a string graph. Vertices are denoted with rectangles, and edges with arrows, and vertex labels as text inside rectangles representing vertices. Walk roots corresponding to the string CATACA—equivalent to vertex index and offsets—are marked with vertical arrows pointing to vertex labels. The walks inducing CATACA starting from the marked walk roots are underlined. Note that one root may induce multiple walks of the example string, that walks stemming from different roots may partially overlap, and that the same path can contain more than one match. Locate queries output all walk roots (vertical arrows), and Walk queries output all paths (segments underlining vertex labels).

**Figure 2.**
Illustrated examples of how to build and query a GIN-TONIC. (I) A string graph of five vertices and eight edges. (II) Arbitrarily sampled constraint sets for three string label prefixes A, AT, AG generated according to the procedure in the section ‘The vertex permutation’. (**III**) Codeword assignment to enforce a permutation preserving a consecutive order of the constraint sets in (II). (IV) The single-vertex range translator over σ₀. contains the ranges of in the permutation given in (III). (V) LF-Mapping constructed on the graph encoding with the permutation obtained in (III) with and . The LF-Mapping is partitioned into chunks of 10 characters for visual and algorithm tracing convenience. (VI) IMT based on the single range mapper . (**VII**) Matching Q = *AAT* using the data structures in V and VI. The query is successively extended one character at a time starting from its end to the beginning as in the linear FM-Index.

formula image — **Figure 2.**
Illustrated examples of how to build and query a GIN-TONIC. (I) A string graph of five vertices and eight edges. (II) Arbitrarily sampled constraint sets for three string label prefixes A, AT, AG generated according to the procedure in the section ‘The vertex permutation’. (**III**) Codeword assignment to enforce a permutation preserving a consecutive order of the constraint sets in (II). (IV) The single-vertex range translator over σ₀. contains the ranges of in the permutation given in (III). (V) LF-Mapping constructed on the graph encoding with the permutation obtained in (III) with and . The LF-Mapping is partitioned into chunks of 10 characters for visual and algorithm tracing convenience. (VI) IMT based on the single range mapper . (**VII**) Matching Q = *AAT* using the data structures in V and VI. The query is successively extended one character at a time starting from its end to the beginning as in the linear FM-Index.

**Figure 3.**
Benchmarking GIN-TONIC (fast) on adversarial queries. Left panels are for the pangenome, right panels for the transcriptome. Panels (A) and (B) show baseline performance using neither cache nor permutation, only permutation, only cache; and performance of different GIN-TONIC implementations (fast, native and sdsl-based) using both cache and permutation as characters matched per second. Panels (C) and (D) show average Δ(Q) as a function of cache depth. Caches of depth 0 and 1 have very similar performances close to the baseline since they cannot traverse more than one vertex, and hence they appear superimposed in the plots. Panels (E) and (F) show Δ(Q) versus characters matched per second, grouped by cache depth (dashed lines) and average effective query length (solid lines).

See this image and copyright information in PMC

References

1. Paten B., Novak A. M., Eizenga J. M., Garrison E.. Genome graphs and the evolution of genome inference. Genome Res. 2017; 27:665–676. - PMC - PubMed
1. Liao W.-W., Asri M., Ebler J., Doerr D., Haukness M., Hickey G., Lu S., Lucas J. K., Monlong J., Abel H. J.et al. .. A draft human pangenome reference. Nature. 2023; 617:312–324. - PMC - PubMed
1. Wang T., Antonacci-Fulton L., Howe K., Lawson H. A., Lucas J. K., Phillippy A. M., Popejoy A. B., Asri M., Carson C., Chaisson M. J. P.et al. .. The Human Pangenome Project: a global resource to map genomic diversity. Nature. 2022; 604:437–446. - PMC - PubMed
1. The Human Pangenome Reference Consortium Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. 2023; 42:663–673. - PMC - PubMed
1. Liu B., Liu Y., Li J., Guo H., Zang T., Wang Y.. deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index. Genome Biol. 2019; 20:274. - PMC - PubMed

LinkOut - more resources

Full Text Sources
- PubMed Central
- Silverchair Information Systems

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

GIN-TONIC: non-hierarchical full-text indexing for graph genomes

Affiliations

GIN-TONIC: non-hierarchical full-text indexing for graph genomes

Authors

Affiliations

Abstract

Figures

References

LinkOut - more resources

Full Text Sources