. 2010 Aug 23;50(8):1358-68.

doi: 10.1021/ci100132g.

Hashing algorithms and data structures for rapid searches of fingerprint vectors

Ramzi Nasr¹, Daniel S Hirschberg, Pierre Baldi

Affiliations

Affiliation

¹ School of Information and Computer Sciences, Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697-3435, USA.

PMID: 20681581
PMCID: PMC2926297
DOI: 10.1021/ci100132g

Hashing algorithms and data structures for rapid searches of fingerprint vectors

Ramzi Nasr et al. J Chem Inf Model. 2010.

. 2010 Aug 23;50(8):1358-68.

doi: 10.1021/ci100132g.

Authors

Ramzi Nasr¹, Daniel S Hirschberg, Pierre Baldi

Affiliation

¹ School of Information and Computer Sciences, Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697-3435, USA.

PMID: 20681581
PMCID: PMC2926297
DOI: 10.1021/ci100132g

Abstract

In many large chemoinformatics database systems, molecules are represented by long binary fingerprint vectors whose components record the presence or absence of particular functional groups or combinatorial features. To speed up database searches, we propose to add to each fingerprint a short signature integer vector of length M. For a given fingerprint, the i component of the signature vector counts the number of 1-bits in the fingerprint that fall on components congruent to i modulo M. Given two signatures, we show how one can rapidly compute a bound on the Jaccard-Tanimoto similarity measure of the two corresponding fingerprints, using the intersection bound. Thus, these signatures allow one to significantly prune the search space by discarding molecules associated with unfavorable bounds. Analytical methods are developed to predict the resulting amount of pruning as a function of M. Data structures combining different values of M are also developed together with methods for predicting the optimal values of M for a given implementation. Simulations using a particular implementation show that the proposed approach leads to a 1 order of magnitude speedup over a linear search and a 3-fold speedup over a previous implementation. All theoretical results and predictions are corroborated by large-scale simulations using molecules from the ChemDB. Several possible algorithmic extensions are discussed.

PubMed Disclaimer

Figures

**Figure 1**
An illustration of modulo hashing at M = 1, 2 and 3. For demonstration purposes, a small 6-bit fingerprint A⃗ = (0, 1, 0, 1, 1, 0) is used.

**Figure 2**
The solid curve represents the empirical fraction of pruning measured over different Tanimoto thresholds with the M = 2 hashing approach. A sample of 100 query fingerprints with A = 205 are selected from ChemDB and run against a random sample of 100,000 fingerprints from ChemDB. The dotted curve shows the analytically predicted fraction of pruning (Equation 14) given A = 205, the mean (205) and standard deviation (97.9) of the number B of 1-bits for the fingerprints in the background sample, and the threshold t.

**Figure 3**
The distribution of the number $a_{i}^{M}$ of 1-bits in a generic bin i for M = 16, 32, 64, and 128, for ChemDB fingerprints with A = 205. The black curve shows the Normal approximation with parameters estimated using the Hypergeometric model (Equations 33 and 34). For larger values of M in the two lower figures, the dashed black line shows how the Poisson distribution provides a better approximation of the empirical data.

**Figure 4**
Top left: the empirical means (black dots) and standard deviations (black error bars) of the intersection bound $S = \sum_{i = 1}^{M} min (a_{i}^{M}, b_{i}^{M})$ as a function of M. The results correspond to 100 query molecules with A = 205 against a sample of 100,000 molecules from ChemDB. The red and green curves show the analytical predictions using the Normal and the Poisson approximations for $a_{i}^{M}$ respectively. Top right: analytical and empirical (dotted lines) amount of pruning for M = 256 and M = 512. Bottom left: empirical distribution of S for M = 256 (magenta) with its Normal approximation. Bottom right: empirical distribution of S for M = 512 (blue) with its Normal approximation.

**Figure 5**
Left: curves showing the average fraction of pruning for different thresholds t and different values of M. The results are obtained by averaging runs of 100 random query molecules against a background of 100,000 molecules from ChemDB. Color legend from right to left: black corresponds to M = 1, red corresponds to M = 2, dashed curve to recursive hashing (see text), cyan to M = 4, magenta to M = 8, yellow to M = 16, green to M = 32, red to M = 64, cyan to M = 128, magenta to M = 256, blue to M = 512. Predicted curves obtained using Equation 28 are shown with dots of the corresponding color for M = 256 and M = 512. Right: similar results and colors, but displaying the fraction of molecules eliminated for different values of M, relative to M = 1.

**Figure 6**
Average amount of pruning across all possible thresholds (t) obtained empirically from 100 random query molecules used to search a random sample of 100,000 ChemDB molecules using fingerprints with lossless compression.

**Figure 7**
Left: the quantity Q = (1 − P₂) × *T_M* + (1 − *P_M*) × *T_Tanimoto* is plotted as a function of M to examine at which M the minimum occurs. Right: the plot compares empirical search times (red) to the approximation formula of Equation 16 with ϕ = 0.32 (blue). The empirical results are obtained by averaging search times across all thresholds t for 100 random query molecules searched against a random sample of 100,000 ChemDB fingerprints encoded with lossless compression.

**Figure 8**
The curves shows the improvement in time over a linear search of the database. The timing results correspond to the average of 100 random ChemDB molecules queried against a random background of 100,000 molecules using a combination of M = 1 hashing with the XOR approach (blue) and the hashing approach with M = 2 followed by M = 256 (red). Note that for very small thresholds (t ≤ 0.2) of little general interest, search time is worse than linear (which corresponds to the dotted line). This is because the amount of pruning is negligible and therefore the extra computational cost associated with the signatures becomes significant.

**Figure 9**
Left: an example of recursive hashing of a small 16-bit fingerprint. Right: the location in the tree data structure where the fingerprint is stored.

See this image and copyright information in PMC

Cited by

The chemfp project.
Dalke A. Dalke A. J Cheminform. 2019 Dec 5;11(1):76. doi: 10.1186/s13321-019-0398-8. J Cheminform. 2019. PMID: 33430977 Free PMC article.
Methods for Similarity-based Virtual Screening.
Kristensen TG, Nielsen J, Pedersen CN. Kristensen TG, et al. Comput Struct Biotechnol J. 2013 Mar 3;5:e201302009. doi: 10.5936/csbj.201302009. eCollection 2013. Comput Struct Biotechnol J. 2013. PMID: 24688702 Free PMC article. Review.
Speeding up chemical searches using the inverted index: the convergence of chemoinformatics and text search methods.
Nasr R, Vernica R, Li C, Baldi P. Nasr R, et al. J Chem Inf Model. 2012 Apr 23;52(4):891-900. doi: 10.1021/ci200552r. Epub 2012 Apr 10. J Chem Inf Model. 2012. PMID: 22462644 Free PMC article.

References

1. Chen J, Swamidass SJ, Dou Y, Bruand J, Baldi P. ChemDB: a public database of small molecules and related chemoinformatics resources. Bioinformatics. 2005;21:4133–4139. - PubMed
1. Irwin JJ, Shoichet BK. ZINC–A Free Database of Commercially Available Compounds for Virtual Screening. J. Chem. Inf. Comput. Sci. 2005;45:177–182. - PMC - PubMed
1. Chen J, Linstead E, Swamidass SJ, Wang D, Baldi P. ChemDB Update–Full Text Search and Virtual Chemical Space. Bioinformatics. 2007;23:2348–2351. - PubMed
1. Wang Y, Xiao J, Suzek T, Zhang J, Wang J, Bryant S. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009;37:W623–W633. - PMC - PubMed
1. Sayers E, Barrett T, Benson D, Bolton E, Bryant S, Canese K, Chetvernin V, Church D, DiCuccio M, Federhen S, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2010;38:D5–D16. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

T15 LM007443/LM/NLM NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Hashing algorithms and data structures for rapid searches of fingerprint vectors

Affiliation

Hashing algorithms and data structures for rapid searches of fingerprint vectors

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources