Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Nov 21;11(1):71.
doi: 10.1186/s13321-019-0393-0.

Randomized SMILES strings improve the quality of molecular generative models

Affiliations

Randomized SMILES strings improve the quality of molecular generative models

Josep Arús-Pous et al. J Cheminform. .

Abstract

Recurrent Neural Networks (RNNs) trained with a set of molecules represented as unique (canonical) SMILES strings, have shown the capacity to create large chemical spaces of valid and meaningful structures. Herein we perform an extensive benchmark on models trained with subsets of GDB-13 of different sizes (1 million, 10,000 and 1000), with different SMILES variants (canonical, randomized and DeepSMILES), with two different recurrent cell types (LSTM and GRU) and with different hyperparameter combinations. To guide the benchmarks new metrics were developed that define how well a model has generalized the training set. The generated chemical space is evaluated with respect to its uniformity, closedness and completeness. Results show that models that use LSTM cells trained with 1 million randomized SMILES, a non-unique molecular string representation, are able to generalize to larger chemical spaces than the other approaches and they represent more accurately the target chemical space. Specifically, a model was trained with randomized SMILES that was able to generate almost all molecules from GDB-13 with a quasi-uniform probability. Models trained with smaller samples show an even bigger improvement when trained with randomized SMILES models. Additionally, models were trained on molecules obtained from ChEMBL and illustrate again that training with randomized SMILES lead to models having a better representation of the drug-like chemical space. Namely, the model trained with randomized SMILES was able to generate at least double the amount of unique molecules with the same distribution of properties comparing to one trained with canonical SMILES.

Keywords: Chemical databases; Deep learning; Generative models; Randomized SMILES; Recurrent Neural Networks; SMILES.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Traversal of the molecular graph of Aspirin using three methods: a the canonical ordering of the molecule; b atom order randomization without RDKit restrictions; c Atom order randomization with RDKit restrictions of the same atom ordering as b. Atom ordering is specified with a number ranking from 1 to 13 for each atom and the arrows show the molecular graph traversal process. Notice that the atom ordering is altered in c, prioritizing the sidechains (red arrows) when traversing a ring and preventing SMILES substrings like c1cc(c(cc1))
Fig. 2
Fig. 2
Architecture of the RNN model used in this study. For every step i, input one-hot encoded token Xi goes through an embedding layer of size mw, followed by l>0 GRU/LSTM layers of size w with dropout in-between and then a linear layer that has dimensionality w and the size of the vocabulary. Lastly a softmax is used to obtain the token probability distribution Yij. Hi symbolizes the input hidden state matrix at step i
Fig. 3
Fig. 3
Plot illustrating the percent of GDB-13 sampled alongside the sample size of the ideal model (blue) and the best of the canonical (yellow), randomized restricted (green) and randomized unrestricted (orange) models. Notice that the ideal model is always an upper bound and eventually (n21B) would sample the entire GDB-13. The trained models would reach the same point much later
Fig. 4
Fig. 4
Histograms of different statistics from the randomized SMILES models. a Kernel Density Estimates (KDEs) of the number of randomized SMILES per molecule from a sample of 1 million molecules from GDB-13. The plot has the x axis cut at 5000, but the unrestricted randomized variant plot has outliers until 15,000. b KDEs of the molecule negative log-likelihood (NLL) for each molecule (summing the probabilities for each randomized SMILES) for the same sample of 1 million molecules from GDB-13. The plot is also cropped between range 19,25. c Histograms between the NLL of all the restricted randomized SMILES of two molecules from GDB-13
Fig. 5
Fig. 5
Linear regression plots between the UC-JSD and the UCC ratio. a Canonical SMILES R2=0.931. b Restricted randomized SMILES R2=0.856. c Unrestricted randomized SMILES R2=0.885
Fig. 6
Fig. 6
Kernel Density Estimates (KDEs) of the Molecule negative log-likelihoods (NLLs) of the ChEMBL models for the canonical SMILES variant (left) and the randomized SMILES variant (right). Each line symbolizes a different subset of 50,000 molecules from: Training set (green), validation set (orange), randomized SMILES model (blue) and canonical SMILES model (yellow). Notice that the Molecule NLLs for the randomized SMILES model (right) are obtained from the sum of all the probabilities of the randomized SMILES for each of the 50,000 molecules (adding up to 320 million randomized SMILES), whereas those from the canonical model are the canonical SMILES of the 50,000 molecules

References

    1. Bohacek RS, McMartin C, Guida WC. ChemInform abstract: the art and practice of structure-based drug design: a molecular modeling perspective. ChemInform. 2010 doi: 10.1002/chin.199617316. - DOI - PubMed
    1. Reymond JL. The chemical space project. Acc Chem Res. 2015;48:722–730. doi: 10.1021/ar500432k. - DOI - PubMed
    1. Blum LC, Reymond JL. 970 Million druglike small molecules for virtual screening in the chemical universe database GDB-13. J Am Chem Soc. 2009;131:8732–8733. doi: 10.1021/ja902302h. - DOI - PubMed
    1. Ruddigkeit L, Van Deursen R, Blum LC, Reymond JL. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J Chem Inf Model. 2012;52:2864–2875. doi: 10.1021/ci300415d. - DOI - PubMed
    1. Visini R, Arús-Pous J, Awale M, Reymond JL. Virtual exploration of the ring systems chemical universe. J Chem Inf Model. 2017;57:2707–2718. doi: 10.1021/acs.jcim.7b00457. - DOI - PubMed

LinkOut - more resources