Reoptimization of MDL keys for use in drug discovery
- PMID: 12444722
- DOI: 10.1021/ci010132r
Reoptimization of MDL keys for use in drug discovery
Abstract
For a number of years MDL products have exposed both 166 bit and 960 bit keysets based on 2D descriptors. These keysets were originally constructed and optimized for substructure searching. We report on improvements in the performance of MDL keysets which are reoptimized for use in molecular similarity. Classification performance for a test data set of 957 compounds was increased from 0.65 for the 166 bit keyset and 0.67 for the 960 bit keyset to 0.71 for a surprisal S/N pruned keyset containing 208 bits and 0.71 for a genetic algorithm optimized keyset containing 548 bits. We present an overview of the underlying technology supporting the definition of descriptors and the encoding of these descriptors into keysets. This technology allows definition of descriptors as combinations of atom properties, bond properties, and atomic neighborhoods at various topological separations as well as supporting a number of custom descriptors. These descriptors can then be used to set one or more bits in a keyset. We constructed various keysets and optimized their performance in clustering bioactive substances. Performance was measured using methodology developed by Briem and Lessel. "Directed pruning" was carried out by eliminating bits from the keysets on the basis of random selection, values of the surprisal of the bit, or values of the surprisal S/N ratio of the bit. The random pruning experiment highlighted the insensitivity of keyset performance for keyset lengths of more than 1000 bits. Contrary to initial expectations, pruning on the basis of the surprisal values of the various bits resulted in keysets which underperformed those resulting from random pruning. In contrast, pruning on the basis of the surprisal S/N ratio was found to yield keysets which performed better than those resulting from random pruning. We also explored the use of genetic algorithms in the selection of optimal keysets. Once more the performance was only a weak function of keyset size, and the optimizations failed to identify a single globally optimal keyset. Instead multiple, equally optimal keysets could be produced which had relatively low overlap of the descriptors they encoded.
Similar articles
-
Selecting diversified compounds to build a tangible library for biological and biochemical assays.Molecules. 2010 Jul 23;15(7):5031-44. doi: 10.3390/molecules15075031. Molecules. 2010. PMID: 20657406 Free PMC article.
-
Power keys: a novel class of topological descriptors based on exhaustive subgraph enumeration and their application in substructure searching.J Chem Inf Model. 2011 Nov 28;51(11):2843-51. doi: 10.1021/ci200282z. Epub 2011 Oct 18. J Chem Inf Model. 2011. PMID: 21955134
-
Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures.Org Biomol Chem. 2004 Nov 21;2(22):3256-66. doi: 10.1039/B409865J. Epub 2004 Sep 29. Org Biomol Chem. 2004. PMID: 15534703
-
Molecular descriptors in chemoinformatics, computational combinatorial chemistry, and virtual screening.Comb Chem High Throughput Screen. 2000 Oct;3(5):363-72. doi: 10.2174/1386207003331454. Comb Chem High Throughput Screen. 2000. PMID: 11032954 Review.
-
Performance of machine learning methods for ligand-based virtual screening.Comb Chem High Throughput Screen. 2009 May;12(4):358-68. doi: 10.2174/138620709788167962. Comb Chem High Throughput Screen. 2009. PMID: 19442065 Review.
Cited by
-
Functional Output Regression for Machine Learning in Materials Science.J Chem Inf Model. 2022 Oct 24;62(20):4837-4851. doi: 10.1021/acs.jcim.2c00626. Epub 2022 Oct 10. J Chem Inf Model. 2022. PMID: 36216342 Free PMC article.
-
Activity cliffs and activity cliff generators based on chemotype-related activity landscapes.Mol Divers. 2015 Nov;19(4):1021-35. doi: 10.1007/s11030-015-9609-z. Epub 2015 Jul 7. Mol Divers. 2015. PMID: 26150300
-
Identification of SARS-CoV-2 Main Protease Inhibitors Using Chemical Similarity Analysis Combined with Machine Learning.Pharmaceuticals (Basel). 2024 Feb 12;17(2):240. doi: 10.3390/ph17020240. Pharmaceuticals (Basel). 2024. PMID: 38399455 Free PMC article.
-
Improving Detection of Arrhythmia Drug-Drug Interactions in Pharmacovigilance Data through the Implementation of Similarity-Based Modeling.PLoS One. 2015 Jun 12;10(6):e0129974. doi: 10.1371/journal.pone.0129974. eCollection 2015. PLoS One. 2015. PMID: 26068584 Free PMC article.
-
Fingerprinting Interactions between Proteins and Ligands for Facilitating Machine Learning in Drug Discovery.Biomolecules. 2024 Jan 5;14(1):72. doi: 10.3390/biom14010072. Biomolecules. 2024. PMID: 38254672 Free PMC article. Review.
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources