Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Apr;27(4):485-499.
doi: 10.1089/cmb.2019.0322. Epub 2020 Mar 16.

An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search

Affiliations

An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search

Fatemeh Almodaresi et al. J Comput Biol. 2020 Apr.

Abstract

The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large-scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure. In this article, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes-patterns of color occurrence-present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e., samples or references) grows into thousands. We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved >11 × better compression compared to Ramen, Ramen, Rao (RRR).

Keywords: RNA-sequence search; compression schemes; de bruijn graph; proximate membership query.

PubMed Disclaimer

Conflict of interest statement

The authors declare they have no competing financial interests.

Figures

FIG. 1.
FIG. 1.
Encoding color classes by finding the MST of the color class graph, an undirected graph derived from cdbg. The order of the process is (a–c). The arrows in (a, b) show the direction of edges in the dbg which is a directed graph. The optimal achievable MST is shown in (d) for comparison. Since we never observe the edge between any k-mers from color classes green and yellow in cdbg, we won't have the edge between color classes green and yellow, and therefore, our final MST is not equal to the best MST we can get from a complete color class graph. cdbg, colored de Bruijn graph; MST, minimum spanning tree.
FIG. 1.
FIG. 1.
Encoding color classes by finding the MST of the color class graph, an undirected graph derived from cdbg. The order of the process is (a–c). The arrows in (a, b) show the direction of edges in the dbg which is a directed graph. The optimal achievable MST is shown in (d) for comparison. Since we never observe the edge between any k-mers from color classes green and yellow in cdbg, we won't have the edge between color classes green and yellow, and therefore, our final MST is not equal to the best MST we can get from a complete color class graph. cdbg, colored de Bruijn graph; MST, minimum spanning tree.
FIG. 2.
FIG. 2.
The conceptual MST (top-left); the data structure to store the color information in the format of an MST (right). This figure also illustrates the steps required to build one of the color vectors (C3) at the leaf of the tree. Note that the query process shown here does not depict the caching policy we apply in practice.
FIG. 3.
FIG. 3.
Size of the MST-based color-class representation versus the RRR-based color-class representation. (a) Sizes of the RRR and MST-based color class representations with respect to the number of samples indexed from the human bulk RNA-seq data set. The counting quotient filter component is the Mantis representation of the de Bruijn graph. (b) Empirical asymptotic analysis of the growth rates of the sizes of RRR-based color class representation and the MST-based color class representation. The RRR-based representation grows at a rate of ≈Θ(n1.5), where n is the number of samples. The MST-based representation grows at a rate of ≈Θ(n1.2). RRR, Ramen, Ramen, Rao.

Similar articles

Cited by

References

    1. Alipanahi B., Kuhnle A., and Boucher C.. 2018a. Recoloring the colored de Bruijn graph, 1–11. In International Symposium on String Processing and Information Retrieval. Springer, Cham
    1. Alipanahi B., Muggli M. D., Jundi M., et al. . 2018b. Resistome SNP calling via read colored de Bruijn graphs. bioRxiv, 156174 Technical Report 312, Operation Research, NC state university, Waden, Germany
    1. Almodaresi F., Pandey P., and Patro R.. 2017. Rainbowfish: A succinct colored de Bruijn graph representation. In LIPIcs-Leibniz International Proceedings in Informatics, volume 88 Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. Oxford University Press, Oxford, England
    1. Almodaresi F., Sarkar H., Srivastava A., et al. . 2018. A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics 34, i169–i177 - PMC - PubMed
    1. Althaus E., Funke S., Har-Peled S., et al. . 2005. Approximating k-hop minimum-spanning trees. Oper. Res. Lett. 33, 115–120

Publication types