Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun 30;39(39 Suppl 1):i252-i259.
doi: 10.1093/bioinformatics/btad225.

Scalable sequence database search using partitioned aggregated Bloom comb trees

Affiliations

Scalable sequence database search using partitioned aggregated Bloom comb trees

Camille Marchet et al. Bioinformatics. .

Abstract

Motivation: The Sequence Read Archive public database has reached 45 petabytes of raw sequences and doubles its nucleotide content every 2 years. Although BLAST-like methods can routinely search for a sequence in a small collection of genomes, making searchable immense public resources accessible is beyond the reach of alignment-based strategies. In recent years, abundant literature tackled the task of finding a sequence in extensive sequence collections using k-mer-based strategies. At present, the most scalable methods are approximate membership query data structures that combine the ability to query small signatures or variants while being scalable to collections up to 10 000 eukaryotic samples. Results. Here, we present PAC, a novel approximate membership query data structure for querying collections of sequence datasets. PAC index construction works in a streaming fashion without any disk footprint besides the index itself. It shows a 3-6 fold improvement in construction time compared to other compressed methods for comparable index size. A PAC query can need single random access and be performed in constant time in favorable instances. Using limited computation resources, we built PAC for very large collections. They include 32 000 human RNA-seq samples in 5 days, the entire GenBank bacterial genome collection in a single day for an index size of 3.5 TB. The latter is, to our knowledge, the largest sequence collection ever indexed using an approximate membership query structure. We also showed that PAC's ability to query 500 000 transcript sequences in less than an hour.

Availability and implementation: PAC's open-source software is available at https://github.com/Malfoy/PAC.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Left: A SBT structure. Right: A Bloom left-comb tree for the same database. Four datasets of a database are represented using grey shapes. Toy example Bloom filters are represented as bit vectors associated with shapes, and the content of each node is recalled. In the case of the Bloom comb tree, the inner nodes’ Bloom filters (highlighted zone in grey) are not explicitly represented but are encoded using an Aggregated Bloom filter instead (integer vector). Runs of 1’s corresponding to the integers are colored. For instance, the leftmost 1 is found at Levels 1 and 2, therefore a run of length two in the vector.
Figure 2.
Figure 2.
(A) Two Bloom comb trees and their Aggregated Bloom filters. The leaves Bloom filters are in the middle, colored in grey. They are the Bloom filters representing the input datasets. The left-comb tree (top) and right-comb tree (bottom) are built by aggregating the Bloom filter’s k-mers. The Aggregated Bloom filter Vl represents the leftmost path of the top comb. Respectively, Vr represents the rightmost path of the bottom comb. We colored in green the second bit of leaves in order to show its encoding in the Aggregated Bloom filters. In the left comb, this bit is set to 1 only in the root for the longest branch; therefore, the value is 1 (green) in Vl. Conversely, the longest branch of the right tree has a run of five 1’s on the second position, hence, a 5 in Vr. (B) Using Vl and Vr from (A), we show how Aggregated Bloom filters define restrained search spaces in the leaves. Again, the leaves’ Bloom filters are grey, and the values of the vectors are printed vertically. Arrows represent the intervals given by each V, and the final search space at each position is the dotted grey area.
Figure 3.
Figure 3.
(A) The content of one partition in PAC. The Aggregated Bloom comb trees constructed for a given partition are represented using Bloom filter leaves (in gray) and two Aggregated Bloom filters Vl and Vr. (B) A single query in the same partition as (A). The index has been inverted. A k-mer enters a PAC by finding the partition corresponding to its minimizer. Vl and Vr are loaded, and a search space (here [3] only) is defined given the position obtained by hashing the k-mer (here the k-mer’s hash value is 2). Then, a slice is extracted from the inverted index at the given position, and bits are checked to find 1’s. Here, there is a single position, so necessarily an 1.
Figure 4.
Figure 4.
Results on increasing human dataset sizes. We report total CPU hours for constructing the different indexes, including the Bloom filter construction and index constructions. X-axis is in log2 scale and Y-axis is in log10 scale. All methods were run with 12 threads.
Figure 5.
Figure 5.
Results on increasing human dataset sizes. We report disk footprints, both temporary and final, for constructing the different indexes. Note that PAC does not use temporary disk. X-axis is in log2 scale and Y-axis is in log10 scale. All methods were run with 12 threads.
Figure 6.
Figure 6.
Results on query batches. We present query CPU times computed on batches of 1 to 10k transcript sequences. The Y-axis is on a log10 scale. The queries were computed using 12 threads.

Similar articles

Cited by

References

    1. Alipanahi B, Kuhnle A, Puglisi SJ. et al. Succinct dynamic de Bruijn graphs. Bioinformatics 2021;37:1946–52. - PMC - PubMed
    1. Almodaresi F, Sarkar H, Srivastava A. et al. A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics 2018;34:i169–77. - PMC - PubMed
    1. Altschul SF, Gish W, Miller W. et al. Basic local alignment search tool. J Mol Biol 1990;215:403–10. - PubMed
    1. Belazzougui D, Gagie T, Mäkinen V. et al. Bidirectional variable-order de Bruijn graphs. Int J Found Comput Sci 2018;29:1279–95.
    1. Bingmann T, Bradley P, Gauger F. et al. COBS: a compact bit-sliced signature index. In: SPIRE. 2019. doi: 1905.09624.

Publication types