A space and time-efficient index for the compacted colored de Bruijn graph

Fatemeh Almodaresi¹, Hirak Sarkar¹, Avi Srivastava¹, Rob Patro¹

Affiliations

PMID: 29949982
PMCID: PMC6022659
DOI: 10.1093/bioinformatics/bty292

A space and time-efficient index for the compacted colored de Bruijn graph

Fatemeh Almodaresi et al. Bioinformatics. 2018.

. 2018 Jul 1;34(13):i169-i177.

doi: 10.1093/bioinformatics/bty292.

Authors

Fatemeh Almodaresi¹, Hirak Sarkar¹, Avi Srivastava¹, Rob Patro¹

Affiliation

¹ Department of Computer Science, Stony Brook University, Stonybrook, NY, USA.

PMID: 29949982
PMCID: PMC6022659
DOI: 10.1093/bioinformatics/bty292

Abstract

Motivation: Indexing reference sequences for search-both individual genomes and collections of genomes-is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large.

Results: We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences. Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment.

Availability and implementation: pufferfish is written in C++11, is open source, and is available at https://github.com/COMBINE-lab/pufferfish.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
An illustration of searching for a particular k-mer, x, in the *dense* pufferfish index. The minimum perfect hash yields the index, $p_{h (\hat{x})}$ in the pos vector where the k-mer appears in the unipath array. The k-mer is validated against the sequence recorded at this position in useq (and, in this case, it matches). A rank operation on $p_{h (\hat{x})}$ is performed in the bv, which yields the corresponding unipath-level information in the utab. If desired, the relative position of the k-mer within the unipath can be retrieved with an extra select and rank operation. Likewise, the rank used to determine this unipath’s utab entry can also be used to look up the edges adjacent to this unipath in the etab table if desired

**Fig. 2.**
Full taxonomy classification evaluation for three tools of Kraken, Clark and Pufferfish. (a–c) We compare the F-1, spearman correlation and mean absolute relative difference metrics for the results of the three tools over the 10 simulated read datasets of LC1-8 and HC1, 2 without using any filtering options. In the plots in the second row, we evaluate accuracy of reports after running each tool with their default filtering option (which filters out any mapping with <20% k-mer coverage for Kraken, 44 nucleotide coverage for Pufferfish and without a ‘high-confidence’ for Clark.)

See this image and copyright information in PMC

References

1. Almodaresi F. et al. (2017) Rainbowfish: a succinct colored de Bruijn graph representation. In: Schwartz, R. and Reinert, K. (eds) 17th International Workshop on Algorithms in Bioinformatics (WABI 2017), volume 88 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 18: 1–18: 15, Dagstuhl, Germany. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
1. Belazzougui D. et al. (2016) Fully dynamic de Bruijn graphs In International Symposium on String Processing and Information Retrieval, Springer, pp. 145–152.
1. Beller T., Ohlebusch E. (2016) A representation of a compressed de Bruijn graph for pan-genome analysis that enables search. Algorithms Mol. Biol., 11, 20. - PMC - PubMed
1. Bowe A. et al. (2012) Succinct de Bruijn graphs. In Proceedings of the International Workshop on Algorithms in Bioinformatics, Springer, pp. 225–235.
1. Bray N.L. et al. (2016) Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol., 34, 525–527. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A space and time-efficient index for the compacted colored de Bruijn graph

Affiliation

A space and time-efficient index for the compacted colored de Bruijn graph

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous