Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 13;20(1):13.
doi: 10.1186/s13015-025-00286-6.

Swiftly identifying strongly unique k-mers

Affiliations

Swiftly identifying strongly unique k-mers

Jens Zentgraf et al. Algorithms Mol Biol. .

Abstract

Motivation: Short DNA sequences of length k that appear in a single location (e.g., at a single genomic position, in a single species from a larger set of species, etc.) are called unique k-mers. They are useful for placing sequenced DNA fragments at the correct location without computing alignments and without ambiguity. However, they are not necessarily robust: A single basepair change may turn a unique k-mer into a different one that may in fact be present at one or more different locations, which may give confusing or contradictory information when attempting to place a read by its k-mer content. A more robust concept are strongly unique k-mers, i.e., unique k-mers for which no Hamming-distance-1 neighbor with conflicting information exists in all of the considered sequences. Given a set of k-mers, it is therefore of interest to have an efficient method that can distinguish k-mers with a Hamming-distance-1 neighbor in the collection from those that do not.

Results: We present engineered algorithms to identify and mark within a set K of (canonical) k-mers all elements that have a Hamming-distance-1 neighbor in the same set. One algorithm is based on recursively running a 4-way comparison on sub-intervals of the sorted set. The other algorithm is based on bucketing and running a pairwise bit-parallel Hamming distance test on small buckets of the sorted set. Both methods consider canonical k-mers (i.e., taking reverse complements into account) and allow for efficient parallelization. The methods have been implemented and applied in practice to sets consisting of several billions of k-mers. An optimized combined approach running with 16 threads on a 16-core workstation yields wall times below 20 seconds on the 2.5 billion distinct 31-mers of the human telomere-to-telomere reference genome.

Availability: An implementation can be found at https://gitlab.com/rahmannlab/strong-k-mers .

Keywords: k-mer; Algorithm engineering; Hamming distance; Parallelization; Strong uniqueness.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Recursive 4-way comparison at different depths d=1,2,3, from left to right. At depth d, the first d-1 characters of all k-mers within the considered interval are equal, and the d-th character is compared. Let b:=k-d. The (at most) four b-mer suffixes of the k-mers pointed to by pointers pc, cΣ are compared, and the minimal b-mer(s) are identified among the four elements. The k-mers with equal minimal b-mers are marked as weak if there are at least two equal minimal b-mers. The pointers of minimal b-mers are incremented (or inactivated, if their bucket has been completely scanned). After the 4-way comparison of an interval is completed, it is repeated recursively with increased depth d+1 on its 4 sub-intervals (only the first such recursive call for the A-subinterval is shown)
Fig. 2
Fig. 2
Weak k-mer identification by 4-way comparison at depth d=6. The first d-1=5 characters are identical. Pointers pc,c{A,C,G,T}, point at k-mers whose d-th character is c. Considering the b:=(k-d)-suffixes, we identify the smallest suffix among the pointed-to k-mers. Left panel: The smallest suffix is TATAA at pC, as indicated by the bit vector (0010). As a single b-suffix is minimal, no weak k-mers are identified, and pC is increased. Middle panel: The smallest b-suffix is TATCA at pA and pG. Therefore, the two k-mers pointed to by pA and pG are identical except for their d-th characters, and we have found a weak pair. Both pA and pG are incremented. Right panel: The smallest b-suffix is TATCG at pA only. No weak k-mers are identified, and pA is incremented
Fig. 3
Fig. 3
All k-mers are split into blocks based on the :=k/2-prefix and further separated into sections based on third quarter s1 and last quarter s2. For odd k, the middle position m must be included in the third quarter s1. We either have |s1|=|s2| or take care that |s1|=|ss|+1, making the third quarter the longer one if their lengths differ by 1. For example, 25=12+7+6 with k=25, =12, |s1|=7 and |s2|=6
Fig. 4
Fig. 4
After checking if the k-mers differ in a single position in the last quarter s2, existence of a single difference in the third quarter s1 has to be checked. For this, the nucleotides of s1 and s2 are swapped. Then, k-mers are locally re-sorted (within buckets of common =k/2-prefixes), and the algorithm for the last quarter is applied again
Fig. 5
Fig. 5
Comparison of the wall times of FourWay, FourWay+Pairwise and Quarter. Left: Wall times for different values of k using 8 threads on the t2t human reference genome with varying numbers of k-mers depending on k (cf. Fig. 8). Shown is the mean (line) over 3 repeated runs (dots). For k=25 and 8 threads, the naive neighborhood generation approach needs 3937 s (65 min), while all presented algorithms need less than 140 s, the best ones around 30 s (add 300 s for pre- and post-processing). Right: Wall times for different values of k using 8 threads on two artificial datasets with 2 billion k-mers each, in which all k-mers are either strong (solid lines) or weak (dashed lines)
Fig. 6
Fig. 6
Wall times (using 8 threads) of the FourWay+Pairwise algorithm on the t2t human reference genome with different k-mers sizes and thresholds at which we switch to the pairwise comparison
Fig. 7
Fig. 7
Comparison of the wall time of FourWay, FourWay+Pairwise and Quarter with different parallelization strategies. Left: Speedup as a function of the number of threads used for parallelization, for fixed k=25. Right: Wall times for different prefix lengths g (x-axis) and number of threads (color). Each choice of gk/2 divides the k-mers into 4g chunks that can be processed independently in parallel. With g=0, the whole array is processed as one chunk and only one thread is used. If 4g is less than the number of threads, only 4g threads can be used
Fig. 8
Fig. 8
Left: Absolute number of strongly unique, weakly unique and non-unique (multi) k-mers in the t2t reference genome for different values of k (stacked bar chart). The total number of k-mers rapidly increases for k between 13 and 19. For k21, there are only a few more new k-mers with increasing k. Right: Percentage of strongly unique k-mers among the distinct k-mers. For k>25, almost 90% of the distinct k-mers are strong
Fig. 9
Fig. 9
Comparison of 25-mers in intergenic regions, coding genes and exons. A 25-mer is part of a gene/exon if it is completely contained in the corresponding annotated interval. Top plot: relative fractions. Bottom plot: absolute number of positions
Fig. 10
Fig. 10
Distribution of strongly unique, weakly unique and non-unique k-mers in the t2t reference genome. We split the genome into blocks of 100 000 base pairs. For each block, we computed the relative fraction of strongly unique (green), weakly unique (blue) and non-unique (orange) k-mers that start in it. Each block is represented by a small vertical stroke; the relative fractions are encoded by the transparency from 0.0 (invisible) to 1.0 (solid color). Regions with repetitive DNA become visible, in particular the centromeres, but also other repeats
Fig. 11
Fig. 11
25-mer frequency histogram (orange right y-axis; log-scale) and fraction of strong 25-mers (green left y-axis; linear scale) for two datasets. Left panel: An Ossabaw pig with a high average coverage of 80×. A high number of the distinct 25-mers in this sample is unique. Of these, 37% are strongly unique and 63% are weakly unique. Right panel: The son of the GIAB Ashkenazi Trio (HG002). The average coverage is 25×. Again, a high number of the distinct k-mers in this sample is unique. Of these, 39% are strongly unique and 61% are weakly unique

References

    1. Rizzi R, Beretta S, Patterson M, Pirola Y, Previtali M, Vedova GD, Bonizzoni P. Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era. Quant Biol. 2019;7(4):278–92. 10.1007/S40484-019-0181-X.
    1. Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016;34(5):525–7. - PubMed
    1. Zentgraf J, Rahmann S. Fast lightweight accurate xenograft sorting. Algorithms Mol Biol. 2021;16(1):2. 10.1186/S13015-021-00181-W. - PMC - PubMed
    1. Breitwieser FP, Baker DN, Salzberg SL. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol. 2018;19(1):198. - PMC - PubMed
    1. Hirsch P, Molano L-AG, Engel A, Zentgraf J, Rahmann S, Hannig M, Müller R, Kern F, Keller A, Schmartz GP. Mibianto: ultra-efficient online microbiome analysis through formula image-mer based metagenomics. Nucl Acids Res. 2024;10:364. - PMC - PubMed