Review

. 2022 May 23;62(10):2316-2331.

doi: 10.1021/acs.jcim.2c00041. Epub 2022 May 10.

Machine Learning on DNA-Encoded Library Count Data Using an Uncertainty-Aware Probabilistic Loss Function

Katherine S Lim^{1

2}, Andrew G Reidenbach³, Bruce K Hua^{3

4}, Jeremy W Mason^{3

5}, Christopher J Gerry^{3

4}, Paul A Clemons³, Connor W Coley^{1

3

6}

Affiliations

¹ Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.
² Department of Biology, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.
³ Chemical Biology and Therapeutics Science Program, Broad Institute, 415 Main Street, Cambridge, Massachusetts 02142, United States.
⁴ Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge, Massachusetts 02138, United States.
⁵ Novartis Institutes for BioMedical Research, Cambridge, Massachusetts 02139, United States.
⁶ Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.

PMID: 35535861
PMCID: PMC10830332
DOI: 10.1021/acs.jcim.2c00041

Review

Machine Learning on DNA-Encoded Library Count Data Using an Uncertainty-Aware Probabilistic Loss Function

Katherine S Lim et al. J Chem Inf Model. 2022.

. 2022 May 23;62(10):2316-2331.

doi: 10.1021/acs.jcim.2c00041. Epub 2022 May 10.

Authors

Katherine S Lim^{1

2}, Andrew G Reidenbach³, Bruce K Hua^{3

4}, Jeremy W Mason^{3

5}, Christopher J Gerry^{3

4}, Paul A Clemons³, Connor W Coley^{1

3

6}

Affiliations

¹ Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.
² Department of Biology, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.
³ Chemical Biology and Therapeutics Science Program, Broad Institute, 415 Main Street, Cambridge, Massachusetts 02142, United States.
⁴ Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge, Massachusetts 02138, United States.
⁵ Novartis Institutes for BioMedical Research, Cambridge, Massachusetts 02139, United States.
⁶ Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.

PMID: 35535861
PMCID: PMC10830332
DOI: 10.1021/acs.jcim.2c00041

Abstract

DNA-encoded library (DEL) screening and quantitative structure-activity relationship (QSAR) modeling are two techniques used in drug discovery to find novel small molecules that bind a protein target. Applying QSAR modeling to DEL selection data can facilitate the selection of compounds for off-DNA synthesis and evaluation. Such a combined approach has been done recently by training binary classifiers to learn DEL enrichments of aggregated "disynthons" in order to accommodate the sparse and noisy nature of DEL data. However, a binary classification model cannot distinguish between different levels of enrichment, and information is potentially lost during disynthon aggregation. Here, we demonstrate a regression approach to learning DEL enrichments of individual molecules, using a custom negative-log-likelihood loss function that effectively denoises DEL data and introduces opportunities for visualization of learned structure-activity relationships. Our approach explicitly models the Poisson statistics of the sequencing process used in the DEL experimental workflow under a frequentist view. We illustrate this approach on a DEL dataset of 108,528 compounds screened against carbonic anhydrase (CAIX), and a dataset of 5,655,000 compounds screened against soluble epoxide hydrolase (sEH) and SIRT2. Due to the treatment of uncertainty in the data through the negative-log-likelihood loss used during training, the models can ignore low-confidence outliers. While our approach does not demonstrate a benefit for extrapolation to novel structures, we expect our denoising and visualization pipeline to be useful in identifying structure-activity trends and highly enriched pharmacophores in DEL data. Further, this approach to uncertainty-aware regression modeling is applicable to other sparse or noisy datasets where the nature of stochasticity is known or can be modeled; in particular, the Poisson enrichment ratio metric we use can apply to other settings that compare sequencing count data between two experimental conditions.

PubMed Disclaimer

Figures

**Figure 1:**
Approaches to analyzing DNA-encoded library (DEL) selection data to infer structure-activity relationships. **(a)** DEL experimental workflow. **(b)** Classic way of getting hits: raw normalized counts. **(c)** McCloskey et al.’s approach: binary classification of “disynthons”. **(d)** Our approach: regression task on “trisynthons,” taking uncertainty into account. The trisynthon drawing style was adapted from McCloskey et al.).

**Figure 2:**
**(a)** Molecule representations. **(b)** Model architectures (message passing network drawing style adapted from Wu et al.). **(c)** Training objectives / loss functions for theoretical count values. **(d)** How molecules in a dataset are divided for model training and evaluation (data splits). NLL: negative-log likelihood; MSE: mean-squared error.

**Figure 3:**
Comparison of model performance, as measured by negative log-likelihood (NLL) loss. OH: one-hot; FP: fingerprint; FFNN: feed-forward neural network; D-MPNN: directed message-passing neural network; KNN: k-nearest neighbors. The NLL test losses of the NLL-trained models (OH-FFNN, FP-FFNN, D-MPNN) are compared to those of the baseline point-prediction-trained models (OH-FFNN pt, FP-FFNN pt, D-MPNN pt), k-nearest-neighbors models (OH-KNN, FP-KNN), and random models (predict all ones, shuffle predictions), for various data splits (*cf.* Figure 2d) on the **(a)** DD1S CAIX, **(b)** triazine sEH, **(c)** triazine SIRT2 datasets. Error bars represent ± one standard deviation. OH-FFNN, FP-FFNN, OH-FFNN pt, FP-FFNN pt, random (predict all ones), and random (shuffle predictions) results are averaged over five trials for each dataset; D-MPNN and D-MPNN pt results are averaged over five trials for the DD1S CAIX dataset and over three trials for the triazine sEH and triazine SIRT2 datasets; OH-KNN and FP-KNN results are averaged over five trials for the DD1S CAIX dataset and are single trials on a random 10% of the test set for the triazine sEH and triazine SIRT2 datasets. The result of each trial is shown separately in the SI (Figure S11).

**Figure 4:**
**(a)** Scatter plot of predicted and calculated enrichments for the test-set compounds of a FP-FFNN on a random split (*cf.* Figure 2d) of the DD1S CAIX dataset. The green parity line is the identity function, for reference. **(b)** Histograms of calculated and predicted enrichments for the test-set compounds of a FP-FFNN on a random split (*cf.* Figure 2d) of the DD1S CAIX dataset. The horizontal axis cutoff of 10 in the histogram of calculated enrichments is arbitrary, for the sake of legibility. **(c)** Close-up of a compound (ID 11676) with high counts (POI, beads only: 151, 644) and low uncertainty. The predicted enrichment of 1.93 approximates the calculated enrichment of 1.92. **(d)** Close-up of a compound (ID 23814) with low counts (POI, beads only: 8, 28) and high uncertainty. The predicted enrichment of 3.91 is high relative to the calculated enrichment of 2.41. **(e)** Close-up of a compound (ID 81804) with low counts (POI, beads only: 1, 0) and high uncertainty. The predicted enrichment of 1.19 is low relative to the calculated enrichment of 29.89. The total barcode counts in this dataset are 638,831 and 5,208,230 for the POI and beads-only conditions, respectively. Error bars represent 95% confidence intervals for calculated enrichments; the horizontal axis values of the scatter plot datapoints are maximum-likelihood calculated enrichments (calculated using12 z = 0; Methods). Compound IDs (“cpd id”) are sequential based on building block cycle numbers.

**Figure 5:**
Scatter plot of predicted and calculated enrichments for a random subset (20,000 compounds) of the test set of a FP-FFNN on a random split (*cf.* Figure 2d) of the **(a)** triazine sEH, **(c)** triazine SIRT2 dataset, and for all disynthons in the **(b)** triazine sEH, **(d)** triazine SIRT2 dataset. The green parity line is the identity function, for reference. Error bars represent 95% confidence intervals for calculated enrichments; the horizontal axis values of the datapoints are maximum-likelihood calculated enrichments (calculated using z = 0; Methods).

**Figure 6:**
**(a)** Workflow for the generation of atom-centered Gaussian visualizations. **(b)** Workflow for calculating fingerprint bit and substructure importance. **(c)** Top 5 substructures and **(d)** example visualizations for compounds in the DD1S CAIX dataset based on the predictions of a FP-FFNN trained on a random split (*cf.* Figure 2d) of the DD1S CAIX dataset. In the example visualizations, atoms contributing positively to enrichment are highlighted in green, and atoms contributing negatively to enrichment are highlighted in pink, with color intensity corresponding to the level of contribution to enrichment. “No” represents the DNA linker attachment point. Compound IDs (“cpd id”) are sequential based on building block cycle numbers.

**Figure 7:**
Atom-centered Gaussian visualizations for example compounds in the test set of a FP-FFNN trained on a random split (*cf.* Figure 2d) of the **(a)** triazine sEH, **(b)** triazine SIRT2 dataset. Atoms contributing positively to enrichment are highlighted in green, and atoms contributing negatively to enrichment are highlighted in pink, with color intensity corresponding to the level of contribution to enrichment. “No” represents the DNA linker attachment point. Compound IDs (“cpd id”) are sequential based on building block cycle numbers.

**Figure 8:**
UMAP projection for **(a)** a random sample of 600k compounds from PubChem, **(b)** DOS-DEL-1, and **(c)** a random sample of 10% of the compounds in the triazine library. The UMAP embedding was fit to all three sets of compounds (using a random 10% of the compounds in DOS-DEL-1) simultaneously (*cf.* Methods). The coordinates of each plot represent the two dimensions to which the molecular fingerprints were projected.

See this image and copyright information in PMC

Cited by

Deep Learning Approach for the Discovery of Tumor-Targeting Small Organic Ligands from DNA-Encoded Chemical Libraries.
Torng W, Biancofiore I, Oehler S, Xu J, Xu J, Watson I, Masina B, Prati L, Favalli N, Bassi G, Neri D, Cazzamalli S, Feng JA. Torng W, et al. ACS Omega. 2023 Jul 6;8(28):25090-25100. doi: 10.1021/acsomega.3c01775. eCollection 2023 Jul 18. ACS Omega. 2023. PMID: 37483198 Free PMC article.
DNA-encoded library-enabled discovery of proximity-inducing small molecules.
Mason JW, Chow YT, Hudson L, Tutter A, Michaud G, Westphal MV, Shu W, Ma X, Tan ZY, Coley CW, Clemons PA, Bonazzi S, Berst F, Briner K, Liu S, Zécri FJ, Schreiber SL. Mason JW, et al. Nat Chem Biol. 2024 Feb;20(2):170-179. doi: 10.1038/s41589-023-01458-4. Epub 2023 Nov 2. Nat Chem Biol. 2024. PMID: 37919549 Free PMC article.
Translating the Genome into Drugs.
Dixit A, Barhoosh H, Paegel BM. Dixit A, et al. Acc Chem Res. 2023 Feb 21;56(4):489-499. doi: 10.1021/acs.accounts.2c00791. Epub 2023 Feb 9. Acc Chem Res. 2023. PMID: 36757774 Free PMC article.
Rational Screening for Cooperativity in Small-Molecule Inducers of Protein-Protein Associations.
Liu S, Tong B, Mason JW, Ostrem JM, Tutter A, Hua BK, Tang SA, Bonazzi S, Briner K, Berst F, Zécri FJ, Schreiber SL. Liu S, et al. J Am Chem Soc. 2023 Oct 25;145(42):23281-23291. doi: 10.1021/jacs.3c08307. Epub 2023 Oct 10. J Am Chem Soc. 2023. PMID: 37816014 Free PMC article.
Machine learning in preclinical drug discovery.
Catacutan DB, Alexander J, Arnold A, Stokes JM. Catacutan DB, et al. Nat Chem Biol. 2024 Aug;20(8):960-973. doi: 10.1038/s41589-024-01679-1. Epub 2024 Jul 19. Nat Chem Biol. 2024. PMID: 39030362 Review.

See all "Cited by" articles

References

1. Schreiber SL A Chemical Biology View of Bioactive Small Molecules and a Binder-Based Approach to Connect Biology to Precision Medicines. Isr. J. Chem. 2019, 59, 52–59. - PMC - PubMed
1. Imming P; Sinning C; Meyer A Drugs, their targets and the nature and number of drug targets. Nat. Rev. Drug Discovery 2006, 5, 821–834. - PubMed
1. Clark MA; Acharya RA; Arico-Muendel CC; Belyanskaya SL; Benjamin DR; Carlson NR; Centrella PA; Chiu CH; Creaser SP; Cuozzo JW; Davie CP; Ding Y; Franklin GJ; Franzen KD; Gefter ML; Hale SP; Hansen NJV; Israel DI; Jiang J; Kavarana MJ; Kelley MS; Kollmann CS; Li F; Lind K; Mataruse S; Medeiros PF; Messer JA; Myers P; O’Keefe H; Oliff MC; Rise CE; Satz AL; Skinner SR; Svendsen JL; Tang L; van Vloten K; Wagner RW; Yao G; Zhao B; Morgan BA Design, synthesis and selection of DNA-encoded small-molecule libraries. Nat. Chem. Biol. 2009, 5, 647–654. - PubMed
1. Kleiner RE; Dumelin CE; Liu DR Small-molecule discovery from DNA-encoded chemical libraries. Chem. Soc. Rev. 2011, 40, 5707–5717. - PMC - PubMed
1. Goodnow RA; Dumelin CE; Keefe AD DNA-encoded chemistry: enabling the deeper sampling of chemical space. Nat. Rev. Drug Discovery 2017, 16, 131–147. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Machine Learning on DNA-Encoded Library Count Data Using an Uncertainty-Aware Probabilistic Loss Function

Affiliations

Machine Learning on DNA-Encoded Library Count Data Using an Uncertainty-Aware Probabilistic Loss Function

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources