Scalable sequence database search using partitioned aggregated Bloom comb trees
- PMID: 37387170
- PMCID: PMC10311332
- DOI: 10.1093/bioinformatics/btad225
Scalable sequence database search using partitioned aggregated Bloom comb trees
Abstract
Motivation: The Sequence Read Archive public database has reached 45 petabytes of raw sequences and doubles its nucleotide content every 2 years. Although BLAST-like methods can routinely search for a sequence in a small collection of genomes, making searchable immense public resources accessible is beyond the reach of alignment-based strategies. In recent years, abundant literature tackled the task of finding a sequence in extensive sequence collections using k-mer-based strategies. At present, the most scalable methods are approximate membership query data structures that combine the ability to query small signatures or variants while being scalable to collections up to 10 000 eukaryotic samples. Results. Here, we present PAC, a novel approximate membership query data structure for querying collections of sequence datasets. PAC index construction works in a streaming fashion without any disk footprint besides the index itself. It shows a 3-6 fold improvement in construction time compared to other compressed methods for comparable index size. A PAC query can need single random access and be performed in constant time in favorable instances. Using limited computation resources, we built PAC for very large collections. They include 32 000 human RNA-seq samples in 5 days, the entire GenBank bacterial genome collection in a single day for an index size of 3.5 TB. The latter is, to our knowledge, the largest sequence collection ever indexed using an approximate membership query structure. We also showed that PAC's ability to query 500 000 transcript sequences in less than an hour.
Availability and implementation: PAC's open-source software is available at https://github.com/Malfoy/PAC.
© The Author(s) 2023. Published by Oxford University Press.
Conflict of interest statement
None declared.
Figures






Similar articles
-
A space and time-efficient index for the compacted colored de Bruijn graph.Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292. Bioinformatics. 2018. PMID: 29949982 Free PMC article.
-
Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment.PLoS One. 2021 Aug 3;16(8):e0255260. doi: 10.1371/journal.pone.0255260. eCollection 2021. PLoS One. 2021. PMID: 34343181 Free PMC article.
-
Improved representation of sequence bloom trees.Bioinformatics. 2020 Feb 1;36(3):721-727. doi: 10.1093/bioinformatics/btz662. Bioinformatics. 2020. PMID: 31504157 Free PMC article.
-
Squeakr: an exact and approximate k-mer counting system.Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636. Bioinformatics. 2018. PMID: 29444235
-
Data structures based on k-mers for querying large collections of sequencing data sets.Genome Res. 2021 Jan;31(1):1-12. doi: 10.1101/gr.260604.119. Epub 2020 Dec 16. Genome Res. 2021. PMID: 33328168 Free PMC article. Review.
Cited by
-
The backpack quotient filter: A dynamic and space-efficient data structure for querying k-mers with abundance.iScience. 2024 Nov 23;27(12):111435. doi: 10.1016/j.isci.2024.111435. eCollection 2024 Dec 20. iScience. 2024. PMID: 39720533 Free PMC article.
-
Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA.Nat Comput Sci. 2024 Feb;4(2):104-109. doi: 10.1038/s43588-024-00596-6. Epub 2024 Feb 26. Nat Comput Sci. 2024. PMID: 38413777
-
Fractional hitting sets for efficient multiset sketching.Algorithms Mol Biol. 2025 Feb 8;20(1):1. doi: 10.1186/s13015-024-00268-0. Algorithms Mol Biol. 2025. PMID: 39923117 Free PMC article.
-
Kaminari: a resource-frugal index for approximate colored k-mer queries.bioRxiv [Preprint]. 2025 May 21:2025.05.16.654317. doi: 10.1101/2025.05.16.654317. bioRxiv. 2025. PMID: 40475623 Free PMC article. Preprint.
-
A survey of k-mer methods and applications in bioinformatics.Comput Struct Biotechnol J. 2024 May 21;23:2289-2303. doi: 10.1016/j.csbj.2024.05.025. eCollection 2024 Dec. Comput Struct Biotechnol J. 2024. PMID: 38840832 Free PMC article. Review.
References
-
- Altschul SF, Gish W, Miller W. et al. Basic local alignment search tool. J Mol Biol 1990;215:403–10. - PubMed
-
- Belazzougui D, Gagie T, Mäkinen V. et al. Bidirectional variable-order de Bruijn graphs. Int J Found Comput Sci 2018;29:1279–95.
-
- Bingmann T, Bradley P, Gauger F. et al. COBS: a compact bit-sliced signature index. In: SPIRE. 2019. doi: 1905.09624.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Research Materials
Miscellaneous