Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Nov;27(11):103351.
doi: 10.1016/j.drudis.2022.103351. Epub 2022 Sep 9.

Combining DELs and machine learning for toxicology prediction

Affiliations
Review

Combining DELs and machine learning for toxicology prediction

Vincent Blay et al. Drug Discov Today. 2022 Nov.

Abstract

DNA-encoded libraries (DELs) allow starting chemical matter to be identified in drug discovery. The volume of experimental data generated also makes DELs an attractive resource for machine learning (ML). ML allows modeling complex relationships between compounds and numerical endpoints, such as the binding to a target measured by DELs. DELs could also empower other areas of drug discovery. Here, we propose that DELs and ML could be combined to model binding to off-targets, enabling better predictive toxicology. With enough data, ML models can make accurate predictions across a vast chemical space, and they can be reused and expanded across projects. Although there are limitations, more general toxicology models could be applied earlier during drug discovery, illuminating safety liabilities at a lower cost.

Keywords: Cheminformatics; DNA-encoded libraries; Deep learning toxicology safety pharmacology; Machine learning.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: S.E. is the owner, and J.G. and F.U. are employees, of Collaborations Pharmaceuticals, Inc.

Figures

Figure 1.
Figure 1.
Multi-task ML is particularly well-suited to model multiple toxicology endpoints. A ConvLSTM model was built to predict 42 toxicity endpoints (IC50 values) based on in vitro data. (A) Estimated time for fingerprint calculation for one billion molecules using rdkit’s GetMorganFingerprintAsBitVector function compared with the custom SMILES tokenization pre-processing used in our ConvLSTM model. (B) Parity plot for 42 mixed targets, showing predicted vs. actual -log(Molar) values for a test-set of compounds, with at least 10 datapoints for each of the 42 toxicity targets (see Table S1). RMSE and R2 shown for the combined predictions. (C) RMSE vs. size of the training data for each target. The highest RMSE (red, CCK1) and lowest RMSE (Serotonin 5HT1B, blue) are highlighted, along with the toxicity target with the largest training set (hERG, pink). (D) t-SNE plot of chemistry space (input: Morgan Fingerprints of radius 3, 1024 bits) showing overlap of the DOS-DEL-1 and combined training sets of the 42 tox targets used to build the model.
Figure 2.
Figure 2.
Combining DELs and ML may provide novel endpoints for predictive toxicology. For instance, the compression of multiple targets in the same DEL could enable a cost-efficient screening for promiscuity. Native proteins can be extracted from human organs, tissues, or cells. These proteins can be used in-solution (A) or be immobilized (B) for DEL selections. (A) The protein extract may be incubated with the DEL. After incubation, a chemically reactive DNA probe (capture probe; a photo-crosslinker diazirine is shown as an example), which is complementary to the common primer-binding site of the library, is added. UV irradiation then triggers the covalent capture of the target and a primer extension step copies the DNA code . The protein-DNA conjugates may be purified by protein extraction or using a built-in biotin group in the capture probe, followed by PCR amplification and DNA sequencing. (B) The proteins are immobilized on beads, and the protein-coated beads are incubated with the DEL. After careful washing of non-binders, bound library members are eluted, PCR-amplified, and sequenced. In both formats, the DNA sequences confidently map to the chemical identity of the compounds. Given the diversity of proteins, the bound compounds identified from the DEL likely represent promiscuous binders for that specific protein mix (see Figure 4). The large amounts of data generated from DELs can be modeled using ML. Different endpoints (e.g., promiscuity across different targets, cell types or tissues) can be modeled simultaneously in a multi-task ML model. The model can then be used to inspect large chemical libraries and remove or filter compounds with potential safety liabilities early in the drug discovery pipeline.
Figure 3.
Figure 3.
DELs might provide novel opportunities for generating data and modeling key ADMET endpoints, such as compound promiscuity and cell permeability. DELs could be incubated with liposomes. The DNA tag may be a significant cargo, and this may allow identifying cell-penetrating compounds from the liposomes (left). Alternatively, a hydrophobic linker might be used to reduce the impact of the DNA tag on the permeability of the small-molecule head (right). This could allow a better assessment of permeability by measuring what fraction of each library member is retained inside the liposomes or on the membrane.
Figure 4.
Figure 4.
Relationships between compound recovery, target concentration ([P]total), individual ligand concentration ([L]total), and binding affinity (Kd) in DEL selections. [P]total, [L]total, and Kd have the same units and are displayed in logarithmic concentration units. The simulation considers the recovery achieved after a single ideal equilibration step, with a simple association-dissociation equilibrium (see Supporting Information for details). Since a DEL assay involves washing steps, we only expect compounds with high recoveries (>90%) to be identified as positives. The simulation indicates that the total protein concentration should be set considerably higher than that of the individual ligands to achieve a high recovery of tight binders (A, C). The results also indicate that, if multiple ligands can compete for the same active site, the total target concentration should be higher than the sum of all of them to enable a high recovery. In other words, the protein concentration affects the stringency of the recovery, such that the lower the protein concentration, the higher the binding affinity of the compound will have to be for it to be positively observed in the DEL (B). As a first approximation, only compounds with Kd ≪ [P]total are expected to be recovered.

References

    1. Wouters OJ, McKee M, Luyten J. Estimated Research and Development Investment Needed to Bring a New Medicine to Market, 2009-2018. JAMA. 2020;323(9):844–853. doi:10.1001/jama.2020.1166 - DOI - PMC - PubMed
    1. Waring MJ, Arrowsmith J, Leach AR, et al. An analysis of the attrition of drug candidates from four major pharmaceutical companies. Nat Rev Drug Discov. 2015;14(7):475–486. doi:10.1038/nrd4609 - DOI - PubMed
    1. Rao MS, Gupta R, Liguori MJ, et al. Novel Computational Approach to Predict Off-Target Interactions for Small Molecules. Frontiers in Big Data. 2019;2. Accessed April 4, 2022. https://www.frontiersin.org/article/10.3389/fdata.2019.00025 - DOI - PMC - PubMed
    1. Avila AM, Bebenek I, Bonzo JA, et al. An FDA/CDER perspective on nonclinical testing strategies: Classical toxicology approaches and new approach methodologies (NAMs). Regul Toxicol Pharmacol. 2020;114:104662. doi:10.1016/j.yrtph.2020.104662 - DOI - PubMed
    1. Bender A, Cortés-Ciriano I. Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 1: Ways to make an impact, and why we are not there yet. Drug Discovery Today. 2021;26(2):511–524. doi:10.1016/j.drudis.2020.12.009 - DOI - PubMed

Publication types

LinkOut - more resources