Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Aug 23;50(8):1358-68.
doi: 10.1021/ci100132g.

Hashing algorithms and data structures for rapid searches of fingerprint vectors

Affiliations

Hashing algorithms and data structures for rapid searches of fingerprint vectors

Ramzi Nasr et al. J Chem Inf Model. .

Abstract

In many large chemoinformatics database systems, molecules are represented by long binary fingerprint vectors whose components record the presence or absence of particular functional groups or combinatorial features. To speed up database searches, we propose to add to each fingerprint a short signature integer vector of length M. For a given fingerprint, the i component of the signature vector counts the number of 1-bits in the fingerprint that fall on components congruent to i modulo M. Given two signatures, we show how one can rapidly compute a bound on the Jaccard-Tanimoto similarity measure of the two corresponding fingerprints, using the intersection bound. Thus, these signatures allow one to significantly prune the search space by discarding molecules associated with unfavorable bounds. Analytical methods are developed to predict the resulting amount of pruning as a function of M. Data structures combining different values of M are also developed together with methods for predicting the optimal values of M for a given implementation. Simulations using a particular implementation show that the proposed approach leads to a 1 order of magnitude speedup over a linear search and a 3-fold speedup over a previous implementation. All theoretical results and predictions are corroborated by large-scale simulations using molecules from the ChemDB. Several possible algorithmic extensions are discussed.

PubMed Disclaimer

Figures

Figure 1
Figure 1
An illustration of modulo hashing at M = 1, 2 and 3. For demonstration purposes, a small 6-bit fingerprint A⃗ = (0, 1, 0, 1, 1, 0) is used.
Figure 2
Figure 2
The solid curve represents the empirical fraction of pruning measured over different Tanimoto thresholds with the M = 2 hashing approach. A sample of 100 query fingerprints with A = 205 are selected from ChemDB and run against a random sample of 100,000 fingerprints from ChemDB. The dotted curve shows the analytically predicted fraction of pruning (Equation 14) given A = 205, the mean (205) and standard deviation (97.9) of the number B of 1-bits for the fingerprints in the background sample, and the threshold t.
Figure 3
Figure 3
The distribution of the number aiM of 1-bits in a generic bin i for M = 16, 32, 64, and 128, for ChemDB fingerprints with A = 205. The black curve shows the Normal approximation with parameters estimated using the Hypergeometric model (Equations 33 and 34). For larger values of M in the two lower figures, the dashed black line shows how the Poisson distribution provides a better approximation of the empirical data.
Figure 4
Figure 4
Top left: the empirical means (black dots) and standard deviations (black error bars) of the intersection bound S=i=1Mmin(aiM,biM) as a function of M. The results correspond to 100 query molecules with A = 205 against a sample of 100,000 molecules from ChemDB. The red and green curves show the analytical predictions using the Normal and the Poisson approximations for aiM respectively. Top right: analytical and empirical (dotted lines) amount of pruning for M = 256 and M = 512. Bottom left: empirical distribution of S for M = 256 (magenta) with its Normal approximation. Bottom right: empirical distribution of S for M = 512 (blue) with its Normal approximation.
Figure 5
Figure 5
Left: curves showing the average fraction of pruning for different thresholds t and different values of M. The results are obtained by averaging runs of 100 random query molecules against a background of 100,000 molecules from ChemDB. Color legend from right to left: black corresponds to M = 1, red corresponds to M = 2, dashed curve to recursive hashing (see text), cyan to M = 4, magenta to M = 8, yellow to M = 16, green to M = 32, red to M = 64, cyan to M = 128, magenta to M = 256, blue to M = 512. Predicted curves obtained using Equation 28 are shown with dots of the corresponding color for M = 256 and M = 512. Right: similar results and colors, but displaying the fraction of molecules eliminated for different values of M, relative to M = 1.
Figure 6
Figure 6
Average amount of pruning across all possible thresholds (t) obtained empirically from 100 random query molecules used to search a random sample of 100,000 ChemDB molecules using fingerprints with lossless compression.
Figure 7
Figure 7
Left: the quantity Q = (1 − P2) × TM + (1 − PM) × TTanimoto is plotted as a function of M to examine at which M the minimum occurs. Right: the plot compares empirical search times (red) to the approximation formula of Equation 16 with ϕ = 0.32 (blue). The empirical results are obtained by averaging search times across all thresholds t for 100 random query molecules searched against a random sample of 100,000 ChemDB fingerprints encoded with lossless compression.
Figure 8
Figure 8
The curves shows the improvement in time over a linear search of the database. The timing results correspond to the average of 100 random ChemDB molecules queried against a random background of 100,000 molecules using a combination of M = 1 hashing with the XOR approach (blue) and the hashing approach with M = 2 followed by M = 256 (red). Note that for very small thresholds (t ≤ 0.2) of little general interest, search time is worse than linear (which corresponds to the dotted line). This is because the amount of pruning is negligible and therefore the extra computational cost associated with the signatures becomes significant.
Figure 9
Figure 9
Left: an example of recursive hashing of a small 16-bit fingerprint. Right: the location in the tree data structure where the fingerprint is stored.

Similar articles

Cited by

References

    1. Chen J, Swamidass SJ, Dou Y, Bruand J, Baldi P. ChemDB: a public database of small molecules and related chemoinformatics resources. Bioinformatics. 2005;21:4133–4139. - PubMed
    1. Irwin JJ, Shoichet BK. ZINC–A Free Database of Commercially Available Compounds for Virtual Screening. J. Chem. Inf. Comput. Sci. 2005;45:177–182. - PMC - PubMed
    1. Chen J, Linstead E, Swamidass SJ, Wang D, Baldi P. ChemDB Update–Full Text Search and Virtual Chemical Space. Bioinformatics. 2007;23:2348–2351. - PubMed
    1. Wang Y, Xiao J, Suzek T, Zhang J, Wang J, Bryant S. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009;37:W623–W633. - PMC - PubMed
    1. Sayers E, Barrett T, Benson D, Bolton E, Bryant S, Canese K, Chetvernin V, Church D, DiCuccio M, Federhen S, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2010;38:D5–D16. - PMC - PubMed