Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct 14;38(20):4812-4813.
doi: 10.1093/bioinformatics/btac564.

ntHash2: recursive spaced seed hashing for nucleotide sequences

Affiliations

ntHash2: recursive spaced seed hashing for nucleotide sequences

Parham Kazemi et al. Bioinformatics. .

Abstract

Motivation: Spaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various bioinformatics tools for efficient sequence analysis with applications in genome research.

Results: ntHash2 is up to 2.1× faster at hashing various spaced seeds than the previous version and 3.8× faster than conventional hashing algorithms with naïve adaptation. Additionally, we reduced the collision rate of ntHash for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism.

Availability and implementation: ntHash2 is freely available online at github.com/bcgsc/ntHash under an MIT license.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
(a) Linear increase of the wall clock time required by ntHash2 to generate spaced seed hashes (Seeds 1–6) from 1 million random 1 kbp sequences (one hash value per seed). Hashing spaced seeds with more blocks and monomers takes more time. (b) Histogram of a million k-mer hashes generated by ntHash2 from random 100-mers. Hash values (H) are distributed uniformly in the normalized 64-bit word space (x-axis). The mean and standard deviation of the bin counts are 1000 ± 31.29, which is close to the ideal value of 1000 hashes per bin. (c) Average wall clock time elapsed by ntHash2 and similar hashing algorithms on a unique dataset (106 random 1 kbp sequences) over 3 runs. Standard deviation was negligible (<500 ms for all tools). Spaced seed patterns (Seed1–Seed6) are described in Supplementary Section S6

References

    1. Chakravarti I.M. et al. (1967) Handbook of Methods of Applied Statistics. Vol. 1. John Wiley and Sons, New York, pp. 392–394.
    1. Chu J. et al. (2020) Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex bloom filters. Proc. Natl. Acad. Sci. USA, 117, 16961–16968. - PMC - PubMed
    1. Ma B. et al. (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics, 18, 440–445. - PubMed
    1. Mohamadi H. et al. (2016) ntHash: recursive nucleotide hashing. Bioinformatics, 32, 3492–3494. - PMC - PubMed
    1. Petrucci E. et al. (2020) Iterative spaced seed hashing: closing the gap between spaced seed hashing and k-mer Hashing. J. Comput. Biol., 27, 223–233. - PubMed

Publication types