Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval
- PMID: 17967006
- PMCID: PMC2536658
- DOI: 10.1021/ci700200n
Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval
Abstract
Many modern chemoinformatics systems for small molecules rely on large fingerprint vector representations, where the components of the vector record the presence or number of occurrences in the molecular graphs of particular combinatorial features, such as labeled paths or labeled trees. These large fingerprint vectors are often compressed to much shorter fingerprint vectors using a lossy compression scheme based on a simple modulo procedure. Here, we combine statistical models of fingerprints with integer entropy codes, such as Golomb and Elias codes, to encode the indices or the run lengths of the fingerprints. After reordering the fingerprint components by decreasing frequency order, the indices are monotone-increasing and the run lengths are quasi-monotone-increasing, and both exhibit power-law distribution trends. We take advantage of these statistical properties to derive new efficient, lossless, compression algorithms for monotone integer sequences: monotone value (MOV) coding and monotone length (MOL) coding. In contrast to lossy systems that use 1024 or more bits of storage per molecule, we can achieve lossless compression of long chemical fingerprints based on circular substructures in slightly over 300 bits per molecule, close to the Shannon entropy limit, using a MOL Elias Gamma code for run lengths. The improvement in storage comes at a modest computational cost. Furthermore, because the compression is lossless, uncompressed similarity (e.g., Tanimoto) between molecules can be computed exactly from their compressed representations, leading to significant improvements in retrival performance, as shown on six benchmark data sets of druglike molecules.
Figures









Similar articles
-
Mathematical correction for fingerprint similarity measures to improve chemical retrieval.J Chem Inf Model. 2007 May-Jun;47(3):952-64. doi: 10.1021/ci600526a. Epub 2007 Apr 20. J Chem Inf Model. 2007. PMID: 17444629
-
Accelerating chemical database searching using graphics processing units.J Chem Inf Model. 2011 Aug 22;51(8):1807-16. doi: 10.1021/ci200164g. Epub 2011 Jul 13. J Chem Inf Model. 2011. PMID: 21696144
-
Speeding up chemical database searches using a proximity filter based on the logical exclusive or.J Chem Inf Model. 2008 Jul;48(7):1367-78. doi: 10.1021/ci800076s. Epub 2008 Jul 2. J Chem Inf Model. 2008. PMID: 18593143
-
Engineering Aspects of Olfaction.In: Persaud KC, Marco S, Gutiérrez-Gálvez A, editors. Neuromorphic Olfaction. Boca Raton (FL): CRC Press/Taylor & Francis; 2013. Chapter 1. In: Persaud KC, Marco S, Gutiérrez-Gálvez A, editors. Neuromorphic Olfaction. Boca Raton (FL): CRC Press/Taylor & Francis; 2013. Chapter 1. PMID: 26042329 Free Books & Documents. Review.
-
Displaying radiologic images on personal computers: image storage and compression--Part 2.J Digit Imaging. 1994 Feb;7(1):1-12. doi: 10.1007/BF03168473. J Digit Imaging. 1994. PMID: 8172973 Review.
Cited by
-
Speeding up chemical searches using the inverted index: the convergence of chemoinformatics and text search methods.J Chem Inf Model. 2012 Apr 23;52(4):891-900. doi: 10.1021/ci200552r. Epub 2012 Apr 10. J Chem Inf Model. 2012. PMID: 22462644 Free PMC article.
-
Computer-Assisted Retrosynthesis Based on Molecular Similarity.ACS Cent Sci. 2017 Dec 27;3(12):1237-1245. doi: 10.1021/acscentsci.7b00355. Epub 2017 Nov 16. ACS Cent Sci. 2017. PMID: 29296663 Free PMC article.
-
Accurate and efficient target prediction using a potency-sensitive influence-relevance voter.J Cheminform. 2015 Dec 29;7:63. doi: 10.1186/s13321-015-0110-6. eCollection 2015. J Cheminform. 2015. PMID: 26719774 Free PMC article.
-
Data-driven high-throughput prediction of the 3-D structure of small molecules: review and progress.J Chem Inf Model. 2011 Apr 25;51(4):760-76. doi: 10.1021/ci100223t. Epub 2011 Mar 18. J Chem Inf Model. 2011. PMID: 21417267 Free PMC article. Review.
-
Hashing algorithms and data structures for rapid searches of fingerprint vectors.J Chem Inf Model. 2010 Aug 23;50(8):1358-68. doi: 10.1021/ci100132g. J Chem Inf Model. 2010. PMID: 20681581 Free PMC article.
References
-
- Fligner MA, Verducci JS, Blower PE. A Modification of the Jaccard/Tanimoto Similarity Index for Diverse Selection of Chemical Compounds Using Binary Strings. Technometrics. 2002;44:110–119.
-
- Flower DR. On the properties of bit string-based measures of chemical similarity. Journal of Chemical Information and Computer Science. 1998;38:378–386.
-
- James CA, Weininger D, Delany J. Daylight Theory Manual. 2004. Current 2007 version available at http://www.daylight.com/dayhtml/doc/theory/index.html.
-
- Xue L, Godden JF, Stahura FL, Bajorath J. Profile scaling increases the similarity search performance of molecular fingerprints containing numerical descriptors and structural keys. Journal of Chemical Information and Computer Sciences. 2003;43:1218–1225. - PubMed
-
- Xue L, Stahura FL, Bajorath J. Similarity search profiling reveals effects of fingerprint scaling in virtual screening. Journal of Chemical Information and Computer Sciences. 2004;44:2032–2039. - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources