Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 28;40(Suppl 1):i48-i57.
doi: 10.1093/bioinformatics/btae217.

Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets

Affiliations

Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets

Igor Martayan et al. Bioinformatics. .

Abstract

Summary: In this article, we introduce the Conway-Bromage-Lyndon (CBL) structure, a compressed, dynamic and exact method for representing k-mer sets. Originating from Conway and Bromage's concept, CBL innovatively employs the smallest cyclic rotations of k-mers, akin to Lyndon words, to leverage lexicographic redundancies. In order to support dynamic operations and set operations, we propose a dynamic bit vector structure that draws a parallel with Elias-Fano's scheme. This structure is encapsulated in a Rust library, demonstrating a balanced blend of construction efficiency, cache locality, and compression. Our findings suggest that CBL outperforms existing dynamic k-mer set methods. Unique to this work, CBL stands out as the only known exact k-mer structure offering in-place set operations. Its different combined abilities position it as a flexible Swiss knife structure for k-mer set management.

Availability and implementation: https://github.com/imartayan/CBL.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Example of conversion into necklace for two k-mers of size 3. This operation is used for insertion and query to the structure. We show how k-mers with an odd length are encoded into 2k1 bits. We use a parity definition as the parity of the population count (number of 1’s) of the binary encoding of k-mers to define canonical k-mers. Thus, provided bases and their complement have different parity (e.g. A00, C01, G11, T10) and since k-mers have an odd length, a k-mer and its reverse complement have a different parity each. By choosing one parity as canonical, it is therefore possible to spare a bit for their encoding, i.e. to use 2k1 bits instead of 2k. With a regular encoding, the k-mers’ necklaces are different, contrary to our example. ATA’s necklace would have been 001100 000011 and TTG’s necklace would have been 111110 0111111 (note that necklaces would have been different even if using reverse complements). In our case, we first convert k-mers to their canonical versions (here, the canonical version has an odd number of 1’s), the last bit is then safely removed. The necklaces are computed on these 2k-1 bits, and their starting positions in k-mers are remembered. Necklaces and positions are stored in the main structure.
Figure 2.
Figure 2.
K-mer information supported in each transform-based data-structure. This information can be stored as the de Bruijn graph of order k1 by using, e.g. the extended Burrows–Wheeler Transform (EBWT). Indeed, the arcs in the dBG represent the set of successive k1 mers, which is the set of k mers. This structure stores successively the different labels of the ingoing arcs by lexicographical order of the nodes, which correspond to the k1 mers. Another way is to use the SBWT structures, which use a k-mer to represent all the arcs ingoing in this, or similar, k-mers (all k-mers with the same prefix of size k1) in the previous structure. Even if we lose some information (false positive arcs and thus false positive k-mers), we can recover all k1-mers with this structure.
Figure 3.
Figure 3.
Schematic view of CBL’s data structure. For the sake of simplicity, we show an example where the necklaces are lexicographic and not binary. On the left we show the entire necklace vector filled with input k-mers’ necklaces. Two overlapping k-mers are pictured sharing very locally close necklaces. We depicted how our structure quotients necklaces, stores prefixes in a bit vector, and associates them to suffixes.
Figure 4.
Figure 4.
Time and RAM used when constructing various indexes on growing bacterial genomes collection from Refseq for k=31 and p=28 bits.
Figure 5.
Figure 5.
Time/memory trade-off of various tools when performing streaming queries of present k-mers (up) and absent k-mers (down) for k=31 and p=28 bits, each point is averaged over a series of experiments.
Figure 6.
Figure 6.
Time/memory trade-off of various tools when performing k-mer insertions (up) and deletions (down) for k=31 and p=28 bits, each point is averaged over a series of experiments.
Figure 7.
Figure 7.
Time/memory trade-off of various tools when performing set intersections for k=31 and p=28 bits.
Figure 8.
Figure 8.
Time/memory trade-off of various tools when performing set union for k=31 and p=28 bits.
Figure 9.
Figure 9.
Time and RAM used when constructing various indexes on growing amounts of chromosomes from the T2T human genome for k=31 and p=28 bits.
Figure 10.
Figure 10.
Time and RAM used when constructing various indexes on a ONT long read dataset for k=31 and p=28 bits.

References

    1. Agret C, Chateau A, Droc G. et al. RedOak: A reference-free and alignment-free structurefor indexing a collection of similar genomes. JOSS 2022;7:4363.
    1. Alanko J, Alipanahi B, Settle J. et al. Buffering updates enables efficient dynamic de Bruijn graphs. Comput Struct Biotechnol J 2021;19:4067–78. - PMC - PubMed
    1. Alanko JN, Puglisi SJ, Vuohtoniemi J.. Small searchable κ-spectra via subset rank queries on the spectral Burrows–Wheeler transform. In: SIAM Conference on Applied and Computational Discrete Algorithms (ACDA23). SIAM, 2023a, 225–36.
    1. Alanko JN, Vuohtoniemi J, Mäklin T. et al. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Bioinformatics 2023b;39:i260–9. 10.1093/bioinformatics/btad233. - DOI - PMC - PubMed
    1. Alipanahi B, Kuhnle A, Puglisi SJ. et al. Succinct dynamic de Bruijn graphs. Bioinformatics 2021;37:1946–52. - PMC - PubMed

Publication types

LinkOut - more resources