An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search
- PMID: 32176522
- PMCID: PMC7185321
- DOI: 10.1089/cmb.2019.0322
An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search
Abstract
The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large-scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure. In this article, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes-patterns of color occurrence-present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e., samples or references) grows into thousands. We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved >11 × better compression compared to Ramen, Ramen, Rao (RRR).
Keywords: RNA-sequence search; compression schemes; de bruijn graph; proximate membership query.
Conflict of interest statement
The authors declare they have no competing financial interests.
Figures




Similar articles
-
Building large updatable colored de Bruijn graphs via merging.Bioinformatics. 2019 Jul 15;35(14):i51-i60. doi: 10.1093/bioinformatics/btz350. Bioinformatics. 2019. PMID: 31510647 Free PMC article.
-
A space and time-efficient index for the compacted colored de Bruijn graph.Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292. Bioinformatics. 2018. PMID: 29949982 Free PMC article.
-
Simplitigs as an efficient and scalable representation of de Bruijn graphs.Genome Biol. 2021 Apr 6;22(1):96. doi: 10.1186/s13059-021-02297-z. Genome Biol. 2021. PMID: 33823902 Free PMC article.
-
deGSM: Memory Scalable Construction Of Large Scale de Bruijn Graph.IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2157-2166. doi: 10.1109/TCBB.2019.2913932. Epub 2021 Dec 8. IEEE/ACM Trans Comput Biol Bioinform. 2021. PMID: 31056509
-
Applications of de Bruijn graphs in microbiome research.Imeta. 2022 Mar 1;1(1):e4. doi: 10.1002/imt2.4. eCollection 2022 Mar. Imeta. 2022. PMID: 38867733 Free PMC article. Review.
Cited by
-
Fulgor: a fast and compact k-mer index for large-scale matching and color queries.Algorithms Mol Biol. 2024 Jan 22;19(1):3. doi: 10.1186/s13015-024-00251-9. Algorithms Mol Biol. 2024. PMID: 38254124 Free PMC article.
-
Shark: fishing relevant reads in an RNA-Seq sample.Bioinformatics. 2021 May 1;37(4):464-472. doi: 10.1093/bioinformatics/btaa779. Bioinformatics. 2021. PMID: 32926128 Free PMC article.
-
Movi Color: fast and accurate long-read classification with the move structure.bioRxiv [Preprint]. 2025 May 27:2025.05.22.655637. doi: 10.1101/2025.05.22.655637. bioRxiv. 2025. PMID: 40502105 Free PMC article. Preprint.
-
Constructing small genome graphs via string compression.Bioinformatics. 2021 Jul 12;37(Suppl_1):i205-i213. doi: 10.1093/bioinformatics/btab281. Bioinformatics. 2021. PMID: 34252955 Free PMC article.
-
Fast and Scalable Parallel External-Memory Construction of Colored Compacted de Bruijn Graphs with Cuttlefish 3.bioRxiv [Preprint]. 2025 Feb 6:2025.02.02.636161. doi: 10.1101/2025.02.02.636161. bioRxiv. 2025. PMID: 39975203 Free PMC article. Preprint.
References
-
- Alipanahi B., Kuhnle A., and Boucher C.. 2018a. Recoloring the colored de Bruijn graph, 1–11. In International Symposium on String Processing and Information Retrieval. Springer, Cham
-
- Alipanahi B., Muggli M. D., Jundi M., et al. . 2018b. Resistome SNP calling via read colored de Bruijn graphs. bioRxiv, 156174 Technical Report 312, Operation Research, NC state university, Waden, Germany
-
- Almodaresi F., Pandey P., and Patro R.. 2017. Rainbowfish: A succinct colored de Bruijn graph representation. In LIPIcs-Leibniz International Proceedings in Informatics, volume 88 Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. Oxford University Press, Oxford, England
-
- Althaus E., Funke S., Har-Peled S., et al. . 2005. Approximating k-hop minimum-spanning trees. Oper. Res. Lett. 33, 115–120
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Research Materials