Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul 21;22(15):7773.
doi: 10.3390/ijms22157773.

BonMOLière: Small-Sized Libraries of Readily Purchasable Compounds, Optimized to Produce Genuine Hits in Biological Screens across the Protein Space

Affiliations

BonMOLière: Small-Sized Libraries of Readily Purchasable Compounds, Optimized to Produce Genuine Hits in Biological Screens across the Protein Space

Neann Mathai et al. Int J Mol Sci. .

Abstract

Experimental screening of large sets of compounds against macromolecular targets is a key strategy to identify novel bioactivities. However, large-scale screening requires substantial experimental resources and is time-consuming and challenging. Therefore, small to medium-sized compound libraries with a high chance of producing genuine hits on an arbitrary protein of interest would be of great value to fields related to early drug discovery, in particular biochemical and cell research. Here, we present a computational approach that incorporates drug-likeness, predicted bioactivities, biological space coverage, and target novelty, to generate optimized compound libraries with maximized chances of producing genuine hits for a wide range of proteins. The computational approach evaluates drug-likeness with a set of established rules, predicts bioactivities with a validated, similarity-based approach, and optimizes the composition of small sets of compounds towards maximum target coverage and novelty. We found that, in comparison to the random selection of compounds for a library, our approach generates substantially improved compound sets. Quantified as the "fitness" of compound libraries, the calculated improvements ranged from +60% (for a library of 15,000 compounds) to +184% (for a library of 1000 compounds). The best of the optimized compound libraries prepared in this work are available for download as a dataset bundle ("BonMOLière").

Keywords: biological screening; evolutionary optimization; genetic algorithms; novel targets; optimized compound library; purchasable compounds; tool compounds.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Overview of the workflow followed to generate optimized compound libraries: (A) source of compounds for the generation of optimized screening libraries, (B) preprocessing of compounds, (C) removal of compounds with undesired properties, (D) target prediction, (E) source of bioactivity data for target prediction, (F) ZINC20 compounds with predicted targets, (G) assignment of Pfam families, (H) genetic algorithm for optimal subset selection, (I) optimal library selected.
Figure 2
Figure 2
Top ten most popular Murcko scaffolds among the pool of candidate compounds. The numbers in the parentheses indicate how many compounds (out of 1,314,755) in the PCC have the scaffold.
Figure 3
Figure 3
Distributions of physicochemical properties observed for the PCC: (A) molecular weight, (B) number of heavy atoms, (C) number of rotatable bonds, (D) number of rings, (E) number of hydrogen bond donors (F) number of hydrogen bond acceptors, (G) logP (note that for a very few compounds the logP value is greater than 4; this is because these logP values are calculated with RDKit (version 2020.09.1.0) [26] and may differ, to some extent, from the calculated logP values provided in the ZINC20 database), and (H) QED score.
Figure 4
Figure 4
Types of proteins among the 3362 targets predicted for the ZINC20 compounds and the 5170 targets in the ChEMBL27 reference set. The size of the bars reflects the percentage of a target type represented while the labels are the counts of the targets for each type.
Figure 5
Figure 5
Distribution of novelty scores of the targets predicted for the PCC and of all targets found in the ChEMBL27 reference set.
Figure 6
Figure 6
Development of the fittest population over 300 generations of a library of (A) 1000 compounds, (B) 5000 compounds, (C) 10,000 compounds, and (D) 15,000 compounds.
Figure 7
Figure 7
Radar charts visualizing the changes in the properties of the fittest library (solid black lines) of (A) 1000-compound library, (B) 5000-compound library, (C) 10,000-compound library, and (D) 15,000-compound library compared with the baseline populations (dashed black lines in each of the diagrams). The fitness values of the individual libraries are noted adjacent to the lines indicating the properties of the respective library.
Figure 8
Figure 8
Radar charts comparing the change of properties between the baseline compound libraries and the further optimized compound libraries. The baseline compound libraries are depicted with dashed black lines for both the 1000-compound library generated with a population of 1000 ((A), black continuous line) and the 5000-compound library generated with a population size of 5000 ((B), black continuous line).
Figure 9
Figure 9
Distribution of the maximum similarities (quantified as Tanimoto coefficient based on Morgan fingerprints with radius 2 and length of 2048 bits) of the compounds derived from the ZINC20 data set to the compounds of the ChEMBL27 reference set for target prediction (derived from the ChEMBL27 database). The line is the kernel density estimate while the bars are the normalized histogram of the pairwise similarities. The distribution shows a large number of dissimilar pairs and a long tail as similarity increases. This observation is consistent with existing knowledge that two random compounds are more likely to be dissimilar than similar [33,34]. Of all the 2,572,351 compounds on which a similarity search was carried out, nearly half the compounds (1,257,596) had a maximum similarity of less than 0.5 to the ChEMBL27 reference set (grey bars). This means that for these compounds no likely targets could be identified by the computational approach. For the purpose of this study, these compounds were hence regarded as “dark chemical matter” [35], and since the aim of this study is to generate compound libraries with the best coverage of the target space, these compounds were discarded. The remaining 1,314,755 compounds (blue bars) were assigned the ChEMBL27 compounds’ targets as predicted targets. These 1,314,755 unique ZINC20 compounds had a coverage of 3362 predicted targets and were retained as the pool of candidate compounds (PCC) from which the final, optimized compound libraries will be generated with the genetic algorithm. The PCC had a median Tanimoto coefficient of 0.59 to the ChEMBL27 reference set and 32,032 compounds (2% of the PCC) had the same Morgan fingerprints as compounds in the ChEMBL27 reference set resulting in the peak at Tanimoto coefficient of 1.
Figure 10
Figure 10
Sum of a geometric progression (S=scale factor(1rcount)(1r)) with a scale factor of 1 and varying values of the common ratio (r) versus the count. When used to calculate the fitness score, a sum of a geometric progression is calculated for each Pfam family (where the novelty score is set as the scale factor, and the number of times the Pfam family is predicted is set as the count) and summed to get the fitness score (Equation (2)).
Figure 11
Figure 11
Schematic of the genetic algorithm which was implemented to select an optimal subset of compounds for the compound library.

References

    1. Macarron R., Banks M.N., Bojanic D., Burns D.J., Cirovic D.A., Garyantes T., Green D.V.S., Hertzberg R.P., Janzen W.P., Paslay J.W., et al. Impact of High-Throughput Screening in Biomedical Research. Nat. Rev. Drug Discov. 2011;10:188–195. doi: 10.1038/nrd3368. - DOI - PubMed
    1. Drewry D.H., Macarron R. Enhancements of Screening Collections to Address Areas of Unmet Medical Need: An Industry Perspective. Curr. Opin. Chem. Biol. 2010;14:289–298. doi: 10.1016/j.cbpa.2010.03.024. - DOI - PubMed
    1. Baell J.B. Broad Coverage of Commercially Available Lead-like Screening Space with Fewer than 350,000 Compounds. J. Chem. Inf. Model. 2013;53:39–55. doi: 10.1021/ci300461a. - DOI - PubMed
    1. Paricharak S., Méndez-Lucio O., Chavan Ravindranath A., Bender A., IJzerman A.P., van Westen G.J.P. Data-Driven Approaches Used for Compound Library Design, Hit Triage and Bioactivity Modeling in High-Throughput Screening. Brief Bioinform. 2018;19:277–285. doi: 10.1093/bib/bbw105. - DOI - PMC - PubMed
    1. Wassermann A.M., Camargo L.M., Auld D.S. Composition and Applications of Focus Libraries to Phenotypic Assays. Front. Pharmacol. 2014;5:164. doi: 10.3389/fphar.2014.00164. - DOI - PMC - PubMed

MeSH terms

LinkOut - more resources