ntHash2: recursive spaced seed hashing for nucleotide sequences

Parham Kazemi^{1

2}, Johnathan Wong¹, Vladimir Nikolić¹, Hamid Mohamadi³, René L Warren¹, Inanç Birol^{1

4}

Affiliations

¹ Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada.
² Faculty of Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
³ Amazon Web Services Inc., Seattle, WA 98109, USA.
⁴ Department of Medical Genetics, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.

PMID: 36000872
PMCID: PMC9563681
DOI: 10.1093/bioinformatics/btac564

ntHash2: recursive spaced seed hashing for nucleotide sequences

Parham Kazemi et al. Bioinformatics. 2022.

. 2022 Oct 14;38(20):4812-4813.

doi: 10.1093/bioinformatics/btac564.

Authors

Parham Kazemi^{1

2}, Johnathan Wong¹, Vladimir Nikolić¹, Hamid Mohamadi³, René L Warren¹, Inanç Birol^{1

4}

Affiliations

¹ Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada.
² Faculty of Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
³ Amazon Web Services Inc., Seattle, WA 98109, USA.
⁴ Department of Medical Genetics, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.

PMID: 36000872
PMCID: PMC9563681
DOI: 10.1093/bioinformatics/btac564

Abstract

Motivation: Spaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various bioinformatics tools for efficient sequence analysis with applications in genome research.

Results: ntHash2 is up to 2.1× faster at hashing various spaced seeds than the previous version and 3.8× faster than conventional hashing algorithms with naïve adaptation. Additionally, we reduced the collision rate of ntHash for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism.

Availability and implementation: ntHash2 is freely available online at github.com/bcgsc/ntHash under an MIT license.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
(a) Linear increase of the wall clock time required by ntHash2 to generate spaced seed hashes (Seeds 1–6) from 1 million random 1 kbp sequences (one hash value per seed). Hashing spaced seeds with more blocks and monomers takes more time. (b) Histogram of a million $k - mer hashes generated by ntHash 2 from random 100 - mers . Hash values (H$ ) are distributed uniformly in the normalized 64-bit word space (x-axis). The mean and standard deviation of the bin counts are 1000 ± 31.29, which is close to the ideal value of 1000 hashes per bin. (c) Average wall clock time elapsed by ntHash2 and similar hashing algorithms on a unique dataset (10⁶ random 1 kbp sequences) over 3 runs. Standard deviation was negligible (<500 ms for all tools). Spaced seed patterns (Seed1–Seed6) are described in Supplementary Section S6

See this image and copyright information in PMC

References

1. Chakravarti I.M. et al. (1967) Handbook of Methods of Applied Statistics. Vol. 1. John Wiley and Sons, New York, pp. 392–394.
1. Chu J. et al. (2020) Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex bloom filters. Proc. Natl. Acad. Sci. USA, 117, 16961–16968. - PMC - PubMed
1. Ma B. et al. (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics, 18, 440–445. - PubMed
1. Mohamadi H. et al. (2016) ntHash: recursive nucleotide hashing. Bioinformatics, 32, 3492–3494. - PMC - PubMed
1. Petrucci E. et al. (2020) Iterative spaced seed hashing: closing the gap between spaced seed hashing and k-mer Hashing. J. Comput. Biol., 27, 223–233. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ntHash2: recursive spaced seed hashing for nucleotide sequences

Affiliations

ntHash2: recursive spaced seed hashing for nucleotide sequences

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous