Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Jun 18:2023.11.21.568151.
doi: 10.1101/2023.11.21.568151.

Prokrustean Graph: A substring index for rapid k-mer size analysis

Affiliations

Prokrustean Graph: A substring index for rapid k-mer size analysis

Adam Park et al. bioRxiv. .

Abstract

The widespread adoption of k-mers in bioinformatics has led to efficient methods utilizing genomic sequences in a variety of biological tasks. However, understanding the influence of k-mer sizes within these methods remains a persistent challenge, as the outputs of complex bioinformatics pipelines obscure this influence with various noisy factors. The choice of k-mer size is often arbitrary, with justification frequently omitted in the literature and method tutorials. Furthermore, recent methods employing multiple k-mer sizes encounter significant computational challenges. Nevertheless, most methods are built on well-defined objects related to k-mers, such as de Bruijn graphs, Jaccard similarity, Bray-Curtis dissimilarity, and k-mer spectra. The role of k-mer sizes within these objects is more intuitive and can be described by numerous quantities and metrics. Therefore, exploring these objects across k-mer sizes opens opportunities for robust analyses and new applications. However, the evolution of k-mer objects with respect to k-mer sizes is surprisingly elusive. We introduce a novel substring index, the Prokrustean graph, that elucidates the transformation of k-mer sets across k-mer sizes. Our framework built upon this index rapidly computes k-mer-based quantities for all k-mer sizes, with computational complexity independent of the size range and dependent only on maximal repeats. For example, counting maximal simple paths in de Bruijn graphs for k = 1 , , 100 is achieved in seconds using our index on a gigabase-scale dataset. We present a variety of such experiments relevant to pangenomics and metagenomics. The Prokrustean graph is space-efficiently constructed from the Burrows-Wheeler Transform. Through this construction, it becomes evident that other modern substring indices inherently face difficulties in exploring k-mer objects across sizes, which motivated our data structure. Our implementation is available at: https://github.com/KoslickiLab/prokrustean.

PubMed Disclaimer

Figures

Fig.1:
Fig.1:
Size decay of a Prokrustean graph as a function of kmin. The sizes of Prokrustean graphs, i.e., number of vertices and edges, versus kmin values for a short read dataset ERR3450203 are depicted. The size of the graph rapidly drops around kmin=[11,,19], meaning the maximal repeats and locally-maximal repeat regions are dense when kmin is under 11. The number of edges falls again around kmin=100 because the graph eventually becomes fully disconnected.
Fig.2:
Fig.2:
The number of maximal unitigs in de Bruijn graphs of order k = 20,...,150 of metagenomic short reads (ERR3450203). A complex scenario is revealed: The decrease of the numbers fluctuates in k = 30,...,60, and the peak around k = 90 corresponds to the sudden decrease in maximal repeats. Lastly, further disconnections increase contigs and eventually make the graph completely disconnected.
Fig.3:
Fig.3:
Bray-Curtis dissimilarities between four metagenomic samples computed with Algorithm S2. We used the sequencing data of four samples derived from infant fecal microbiota, which were studied in [36]. Samples include two from 12-month-old human subjects and two from 3-week-olds. The relative order between values shift as k changes, e.g. compared to all other dissimilarities, the dissimilarity between sample 2 and 4 is highest at k = 8 but lowest at k = 21. Note that diverse k sizes (10, 12, 21, 31) are actively utilized in practice.
Fig.4:
Fig.4:
Vertex degrees of the overlap graph of metagenomic short reads (ERR3450203) computed with Algorithm S4. The number of vertices per each degree is scaled as log log y. There is a clear peak around 102 at both incoming and outgoing degrees, which may imply dense existence of abundant species. The intermittent peaks after 102 might be related to repeating regions within and across species.
Fig.5:
Fig.5:
Recursive repeats. At each sequence on the top level, locally-maximal repeat regions (indicated with blue underlining) are used to denote substrings that are more frequent than its superstring. The arrows then points to a node of the substring, and the process can repeat hierarchically.
Fig.6:
Fig.6:
The Prokrustean graph of U={ACCCT,GACCC,TCCCG,GGCG}. The left graph illustrates the recursion of locally-maximal repeat regions, depicted as blue regions, following the same rule described in Figure 5. The Prokrustean graph on the right represents the same structure that white vertices correspond to sequences U and blue vertices represent maximal repeats RU found through the recursion process shown in the left graph. The strings are not stored, so integer labels store the regions and the size of the substrings.
Fig.7:
Fig.7:
Complementary regions. Blue underlining depicts locally maximal regions, and red underlining depicts regions that complement k-mers not covered in blue regions, given k values of 3 and 4. Every k-mer appears exactly once within a red region. For example, in the left image, the 3-mer ACC is included in a red region within ACCCC and does not appear again in any other red region. Note that red regions can include multiple co-occurring k-mers. So, in the right image, both 4-mers TCCC and CCCG co-occur within TCCCG.
Fig.8:
Fig.8:
Three types of maximal co-occurrence stacks. Maximal co-occurrence stacks of S (sets of red rectangles) are introduced at each row for each type, given two locally-maximal repeat regions in S (blue rectangles). The Rate column describes the change in the count of expressed k-mers as the k-mer size increases by 1. Hence, the rate is δl+δr1. Also, δl and δr control the vertical shapes of stacks. For example, the Type 0 stack forms like a rectangle with δl=δr=0, but the Type 1 stack forms like stairs on its left with δl=1.

Similar articles

References

    1. AlEisa H. N., Hamad S., and Elhadad A.. K-mer spectrum-based error correction algorithm for next-generation sequencing data. Computational Intelligence and Neuroscience, 2022, 2022. - PMC - PubMed
    1. Balvert M., Luo X., Hauptfeld E., Schönhuth A., and Dutilh B. E.. Ogre: overlap graph-based metagenomic read clustering. Bioinformatics, 37(7):905–912, 2021. - PMC - PubMed
    1. Bankevich A., Bzikadze A. V., Kolmogorov M., Antipov D., and Pevzner P. A.. Multiplex de bruijn graphs enable genome assembly from long, high-fidelity reads. Nature biotechnology, 40(7):1075–1081, 2022. - PubMed
    1. Belazzougui D.. Linear time construction of compressed text indices in compact space. In Proceedings of the forty-sixth Annual ACM Symposium on Theory of Computing, pages 148–193, 2014.
    1. Belazzougui D. and Cunial F.. Fully-functional bidirectional burrows-wheeler indexes and infinite-order de bruijn graphs. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019.

Publication types