Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 May 23;62(10):2316-2331.
doi: 10.1021/acs.jcim.2c00041. Epub 2022 May 10.

Machine Learning on DNA-Encoded Library Count Data Using an Uncertainty-Aware Probabilistic Loss Function

Affiliations
Review

Machine Learning on DNA-Encoded Library Count Data Using an Uncertainty-Aware Probabilistic Loss Function

Katherine S Lim et al. J Chem Inf Model. .

Abstract

DNA-encoded library (DEL) screening and quantitative structure-activity relationship (QSAR) modeling are two techniques used in drug discovery to find novel small molecules that bind a protein target. Applying QSAR modeling to DEL selection data can facilitate the selection of compounds for off-DNA synthesis and evaluation. Such a combined approach has been done recently by training binary classifiers to learn DEL enrichments of aggregated "disynthons" in order to accommodate the sparse and noisy nature of DEL data. However, a binary classification model cannot distinguish between different levels of enrichment, and information is potentially lost during disynthon aggregation. Here, we demonstrate a regression approach to learning DEL enrichments of individual molecules, using a custom negative-log-likelihood loss function that effectively denoises DEL data and introduces opportunities for visualization of learned structure-activity relationships. Our approach explicitly models the Poisson statistics of the sequencing process used in the DEL experimental workflow under a frequentist view. We illustrate this approach on a DEL dataset of 108,528 compounds screened against carbonic anhydrase (CAIX), and a dataset of 5,655,000 compounds screened against soluble epoxide hydrolase (sEH) and SIRT2. Due to the treatment of uncertainty in the data through the negative-log-likelihood loss used during training, the models can ignore low-confidence outliers. While our approach does not demonstrate a benefit for extrapolation to novel structures, we expect our denoising and visualization pipeline to be useful in identifying structure-activity trends and highly enriched pharmacophores in DEL data. Further, this approach to uncertainty-aware regression modeling is applicable to other sparse or noisy datasets where the nature of stochasticity is known or can be modeled; in particular, the Poisson enrichment ratio metric we use can apply to other settings that compare sequencing count data between two experimental conditions.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Approaches to analyzing DNA-encoded library (DEL) selection data to infer structure-activity relationships. (a) DEL experimental workflow. (b) Classic way of getting hits: raw normalized counts. (c) McCloskey et al.’s approach: binary classification of “disynthons”. (d) Our approach: regression task on “trisynthons,” taking uncertainty into account. The trisynthon drawing style was adapted from McCloskey et al.).
Figure 2:
Figure 2:
(a) Molecule representations. (b) Model architectures (message passing network drawing style adapted from Wu et al.). (c) Training objectives / loss functions for theoretical count values. (d) How molecules in a dataset are divided for model training and evaluation (data splits). NLL: negative-log likelihood; MSE: mean-squared error.
Figure 3:
Figure 3:
Comparison of model performance, as measured by negative log-likelihood (NLL) loss. OH: one-hot; FP: fingerprint; FFNN: feed-forward neural network; D-MPNN: directed message-passing neural network; KNN: k-nearest neighbors. The NLL test losses of the NLL-trained models (OH-FFNN, FP-FFNN, D-MPNN) are compared to those of the baseline point-prediction-trained models (OH-FFNN pt, FP-FFNN pt, D-MPNN pt), k-nearest-neighbors models (OH-KNN, FP-KNN), and random models (predict all ones, shuffle predictions), for various data splits (cf. Figure 2d) on the (a) DD1S CAIX, (b) triazine sEH, (c) triazine SIRT2 datasets. Error bars represent ± one standard deviation. OH-FFNN, FP-FFNN, OH-FFNN pt, FP-FFNN pt, random (predict all ones), and random (shuffle predictions) results are averaged over five trials for each dataset; D-MPNN and D-MPNN pt results are averaged over five trials for the DD1S CAIX dataset and over three trials for the triazine sEH and triazine SIRT2 datasets; OH-KNN and FP-KNN results are averaged over five trials for the DD1S CAIX dataset and are single trials on a random 10% of the test set for the triazine sEH and triazine SIRT2 datasets. The result of each trial is shown separately in the SI (Figure S11).
Figure 4:
Figure 4:
(a) Scatter plot of predicted and calculated enrichments for the test-set compounds of a FP-FFNN on a random split (cf. Figure 2d) of the DD1S CAIX dataset. The green parity line is the identity function, for reference. (b) Histograms of calculated and predicted enrichments for the test-set compounds of a FP-FFNN on a random split (cf. Figure 2d) of the DD1S CAIX dataset. The horizontal axis cutoff of 10 in the histogram of calculated enrichments is arbitrary, for the sake of legibility. (c) Close-up of a compound (ID 11676) with high counts (POI, beads only: 151, 644) and low uncertainty. The predicted enrichment of 1.93 approximates the calculated enrichment of 1.92. (d) Close-up of a compound (ID 23814) with low counts (POI, beads only: 8, 28) and high uncertainty. The predicted enrichment of 3.91 is high relative to the calculated enrichment of 2.41. (e) Close-up of a compound (ID 81804) with low counts (POI, beads only: 1, 0) and high uncertainty. The predicted enrichment of 1.19 is low relative to the calculated enrichment of 29.89. The total barcode counts in this dataset are 638,831 and 5,208,230 for the POI and beads-only conditions, respectively. Error bars represent 95% confidence intervals for calculated enrichments; the horizontal axis values of the scatter plot datapoints are maximum-likelihood calculated enrichments (calculated using12 z = 0; Methods). Compound IDs (“cpd id”) are sequential based on building block cycle numbers.
Figure 5:
Figure 5:
Scatter plot of predicted and calculated enrichments for a random subset (20,000 compounds) of the test set of a FP-FFNN on a random split (cf. Figure 2d) of the (a) triazine sEH, (c) triazine SIRT2 dataset, and for all disynthons in the (b) triazine sEH, (d) triazine SIRT2 dataset. The green parity line is the identity function, for reference. Error bars represent 95% confidence intervals for calculated enrichments; the horizontal axis values of the datapoints are maximum-likelihood calculated enrichments (calculated using z = 0; Methods).
Figure 6:
Figure 6:
(a) Workflow for the generation of atom-centered Gaussian visualizations. (b) Workflow for calculating fingerprint bit and substructure importance. (c) Top 5 substructures and (d) example visualizations for compounds in the DD1S CAIX dataset based on the predictions of a FP-FFNN trained on a random split (cf. Figure 2d) of the DD1S CAIX dataset. In the example visualizations, atoms contributing positively to enrichment are highlighted in green, and atoms contributing negatively to enrichment are highlighted in pink, with color intensity corresponding to the level of contribution to enrichment. “No” represents the DNA linker attachment point. Compound IDs (“cpd id”) are sequential based on building block cycle numbers.
Figure 7:
Figure 7:
Atom-centered Gaussian visualizations for example compounds in the test set of a FP-FFNN trained on a random split (cf. Figure 2d) of the (a) triazine sEH, (b) triazine SIRT2 dataset. Atoms contributing positively to enrichment are highlighted in green, and atoms contributing negatively to enrichment are highlighted in pink, with color intensity corresponding to the level of contribution to enrichment. “No” represents the DNA linker attachment point. Compound IDs (“cpd id”) are sequential based on building block cycle numbers.
Figure 8:
Figure 8:
UMAP projection for (a) a random sample of 600k compounds from PubChem, (b) DOS-DEL-1, and (c) a random sample of 10% of the compounds in the triazine library. The UMAP embedding was fit to all three sets of compounds (using a random 10% of the compounds in DOS-DEL-1) simultaneously (cf. Methods). The coordinates of each plot represent the two dimensions to which the molecular fingerprints were projected.

Similar articles

Cited by

References

    1. Schreiber SL A Chemical Biology View of Bioactive Small Molecules and a Binder-Based Approach to Connect Biology to Precision Medicines. Isr. J. Chem. 2019, 59, 52–59. - PMC - PubMed
    1. Imming P; Sinning C; Meyer A Drugs, their targets and the nature and number of drug targets. Nat. Rev. Drug Discovery 2006, 5, 821–834. - PubMed
    1. Clark MA; Acharya RA; Arico-Muendel CC; Belyanskaya SL; Benjamin DR; Carlson NR; Centrella PA; Chiu CH; Creaser SP; Cuozzo JW; Davie CP; Ding Y; Franklin GJ; Franzen KD; Gefter ML; Hale SP; Hansen NJV; Israel DI; Jiang J; Kavarana MJ; Kelley MS; Kollmann CS; Li F; Lind K; Mataruse S; Medeiros PF; Messer JA; Myers P; O’Keefe H; Oliff MC; Rise CE; Satz AL; Skinner SR; Svendsen JL; Tang L; van Vloten K; Wagner RW; Yao G; Zhao B; Morgan BA Design, synthesis and selection of DNA-encoded small-molecule libraries. Nat. Chem. Biol. 2009, 5, 647–654. - PubMed
    1. Kleiner RE; Dumelin CE; Liu DR Small-molecule discovery from DNA-encoded chemical libraries. Chem. Soc. Rev. 2011, 40, 5707–5717. - PMC - PubMed
    1. Goodnow RA; Dumelin CE; Keefe AD DNA-encoded chemistry: enabling the deeper sampling of chemical space. Nat. Rev. Drug Discovery 2017, 16, 131–147. - PubMed

Publication types