Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jul 1;34(13):i169-i177.
doi: 10.1093/bioinformatics/bty292.

A space and time-efficient index for the compacted colored de Bruijn graph

Affiliations

A space and time-efficient index for the compacted colored de Bruijn graph

Fatemeh Almodaresi et al. Bioinformatics. .

Abstract

Motivation: Indexing reference sequences for search-both individual genomes and collections of genomes-is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large.

Results: We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences. Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment.

Availability and implementation: pufferfish is written in C++11, is open source, and is available at https://github.com/COMBINE-lab/pufferfish.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
An illustration of searching for a particular k-mer, x, in the dense pufferfish index. The minimum perfect hash yields the index, ph(x^) in the pos vector where the k-mer appears in the unipath array. The k-mer is validated against the sequence recorded at this position in useq (and, in this case, it matches). A rank operation on ph(x^) is performed in the bv, which yields the corresponding unipath-level information in the utab. If desired, the relative position of the k-mer within the unipath can be retrieved with an extra select and rank operation. Likewise, the rank used to determine this unipath’s utab entry can also be used to look up the edges adjacent to this unipath in the etab table if desired
Fig. 2.
Fig. 2.
Full taxonomy classification evaluation for three tools of Kraken, Clark and Pufferfish. (ac) We compare the F-1, spearman correlation and mean absolute relative difference metrics for the results of the three tools over the 10 simulated read datasets of LC1-8 and HC1, 2 without using any filtering options. In the plots in the second row, we evaluate accuracy of reports after running each tool with their default filtering option (which filters out any mapping with <20% k-mer coverage for Kraken, 44 nucleotide coverage for Pufferfish and without a ‘high-confidence’ for Clark.)

References

    1. Almodaresi F. et al. (2017) Rainbowfish: a succinct colored de Bruijn graph representation. In: Schwartz, R. and Reinert, K. (eds) 17th International Workshop on Algorithms in Bioinformatics (WABI 2017), volume 88 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 18: 1–18: 15, Dagstuhl, Germany. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
    1. Belazzougui D. et al. (2016) Fully dynamic de Bruijn graphs In International Symposium on String Processing and Information Retrieval, Springer, pp. 145–152.
    1. Beller T., Ohlebusch E. (2016) A representation of a compressed de Bruijn graph for pan-genome analysis that enables search. Algorithms Mol. Biol., 11, 20. - PMC - PubMed
    1. Bowe A. et al. (2012) Succinct de Bruijn graphs. In Proceedings of the International Workshop on Algorithms in Bioinformatics, Springer, pp. 225–235.
    1. Bray N.L. et al. (2016) Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol., 34, 525–527. - PubMed

Publication types