Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval

Pierre Baldi¹, Ryan W Benz, Daniel S Hirschberg, S Joshua Swamidass

Affiliations

Affiliation

¹ Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California-Irvine, Irvine, CA 92697-3435, USA. pfbaldi@ics.uci.edu

PMID: 17967006
PMCID: PMC2536658
DOI: 10.1021/ci700200n

Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval

Pierre Baldi et al. J Chem Inf Model. 2007 Nov-Dec.

. 2007 Nov-Dec;47(6):2098-109.

doi: 10.1021/ci700200n. Epub 2007 Oct 30.

Authors

Pierre Baldi¹, Ryan W Benz, Daniel S Hirschberg, S Joshua Swamidass

Affiliation

¹ Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California-Irvine, Irvine, CA 92697-3435, USA. pfbaldi@ics.uci.edu

PMID: 17967006
PMCID: PMC2536658
DOI: 10.1021/ci700200n

Abstract

Many modern chemoinformatics systems for small molecules rely on large fingerprint vector representations, where the components of the vector record the presence or number of occurrences in the molecular graphs of particular combinatorial features, such as labeled paths or labeled trees. These large fingerprint vectors are often compressed to much shorter fingerprint vectors using a lossy compression scheme based on a simple modulo procedure. Here, we combine statistical models of fingerprints with integer entropy codes, such as Golomb and Elias codes, to encode the indices or the run lengths of the fingerprints. After reordering the fingerprint components by decreasing frequency order, the indices are monotone-increasing and the run lengths are quasi-monotone-increasing, and both exhibit power-law distribution trends. We take advantage of these statistical properties to derive new efficient, lossless, compression algorithms for monotone integer sequences: monotone value (MOV) coding and monotone length (MOL) coding. In contrast to lossy systems that use 1024 or more bits of storage per molecule, we can achieve lossless compression of long chemical fingerprints based on circular substructures in slightly over 300 bits per molecule, close to the Shannon entropy limit, using a MOL Elias Gamma code for run lengths. The improvement in storage comes at a modest computational cost. Furthermore, because the compression is lossless, uncompressed similarity (e.g., Tanimoto) between molecules can be computed exactly from their compressed representations, leading to significant improvements in retrival performance, as shown on six benchmark data sets of druglike molecules.

PubMed Disclaimer

Figures

**Figure 1**
A molecule represented as a labeled graph. The labels on the vertices correspond to atom symbols and those on the edges describe the type of covalent bond between atoms (e.g. ‘s’ for single bond, ‘d’ for double bond). Also shown are examples of labeled paths of length 1 and 2 resulting from a depth-first search exploration of the graph, starting from one of the carbon atoms.

**Figure 2**
Fingerprint density as a function of fingerprint length. Logarithms on both axes are taken to base 2. The average density is captured by p, the probability of a 1-bit in long fingerprint vectors computed across all the molecules. Blue lines correspond to path features with Element labeling. Red lines correspond to circular substructure features, with Extended Connectivity labeling.

**Figure 3**
Power law distributions for paths of length 0−8 and circular substructures of depth 0−2. The x axis corresponds to the feature's rank and the y axis to the feature's probability, using logarithmic scales on both axes.

**Figure 4**
Illustration of the folding process with a binary vector of length N_* = 16 (1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 ) folded into a binary vector of length N = 4 (1 0 0 1), modulo 4. Note how information in the first position of the compressed vector is lost due to clashes.

**Figure 5**
Monotone Value Coding (MOV).The principle is illustrated using the index representation vector (1, 2, 3, 9, 14, 26, 29). Each integer j is converted to a binary representation of length ⌊log j⌋ which begins with a 1-bit. 0-bits are used between two consecutive integers only when the length (scale) increases. The number of 0-bits is equal to the increase in the length.

**Figure 6**
Monotone Length Coding (MOL). The principle is illustrated using the run-length vector (0, 0, 0, 5, 4, 11, 2) associated with the index vector of Figure 5. Each integer j, except for the initial 0's, is converted to a binary representation of length ⌊log j⌋ + 1 which begins with a 1-bit. In addition, a 1-bit is used between two consecutive integers when the scale does not increase. 0-bits are used between two consecutive integers only when the length (scale) increases. The number of 0-bits is equal to the increase in the scale. The three initial 0-bits are associated with a scale of 0 leading to the three initial 1-bits in the encoding, followed by three 0-bits to denote the increase in scale from 0 to 3 bits.

**Figure 7**
Compression results for Golomb-Rice (red), MOV Elias Gamma for indices (green), and MOL Elias Gamma for run-lengths (blue) encoding schemes applied to binary fingerprints based on paths. Posthash refers to the option where features that are 0 for all the molecules in the database are removed. Sorted refers to the option where features are sorted in decreasing order of frequency across the molecules in the database. Curves represent the average number of bits required per molecule as a function of uncompressed fingerprint size (log *N_hash*). The entropy curve corresponds to the approximate Shannon entropy limit - Σ_i[*p_i* log *p_i* + (1 - *p_i*) log(1 - *p_i*)], derived under the independent component approximation. Plots derived using a random subset of 50,000 molecules from the ChemDB.

**Figure 8**
Compression results for Golomb-Rice (red), MOV Elias Gamma for indices (green), and MOL Elias Gamma for run-lengths (blue) encoding schemes applied to binary fingerprints based on circular substructures. Posthash refers to the option where features that are 0 for all the molecules in the database are removed. Sorted refers to the option where features are sorted in decreasing order of frequency across the molecules in the database. Curves represent the average number of bits required per molecule as a function of uncompressed fingerprint size (log *N_hash*). The entropy curve corresponds to the approximate Shannon entropy limit - Σ_i[*p_i* log *p_i* + (1 - *p_i*) log(1 - *p_i*)], derived under the independent component approximation. Plots derived using a random subset of 50,000 molecules from the ChemDB.

**Figure 9**
ROC retrieval curved based on Tanimoto similarity measures, computed from lossy and lossless compressed fingerprints. Curves are obtained using molecules from six biologically relevant datasets using leave-one-out cross validation. Each ROC curve is constructed by aggregating the ROC curves calculated by using each molecule in the group to search for the rest of the group against the background provided by the random subset of 50,000 molecules from the ChemDB. Lossless compression leads to better retrieval performance corresponding, for instance, to an average increase of 18% for the area under the ROC curves (AUC measure). Lossy fingerprints are derived by modulo compression to 512 bits.

See this image and copyright information in PMC

References

1. Fligner MA, Verducci JS, Blower PE. A Modification of the Jaccard/Tanimoto Similarity Index for Diverse Selection of Chemical Compounds Using Binary Strings. Technometrics. 2002;44:110–119.
1. Flower DR. On the properties of bit string-based measures of chemical similarity. Journal of Chemical Information and Computer Science. 1998;38:378–386.
1. James CA, Weininger D, Delany J. Daylight Theory Manual. 2004. Current 2007 version available at http://www.daylight.com/dayhtml/doc/theory/index.html.
1. Xue L, Godden JF, Stahura FL, Bajorath J. Profile scaling increases the similarity search performance of molecular fingerprints containing numerical descriptors and structural keys. Journal of Chemical Information and Computer Sciences. 2003;43:1218–1225. - PubMed
1. Xue L, Stahura FL, Bajorath J. Similarity search profiling reveals effects of fingerprint scaling in virtual screening. Journal of Chemical Information and Computer Sciences. 2004;44:2032–2039. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval

Affiliation

Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources