. 2019 Mar 12;11(1):20.

doi: 10.1186/s13321-019-0341-z.

Exploring the GDB-13 chemical space using deep generative models

Josep Arús-Pous^{1

2}, Thomas Blaschke^{3

4}, Silas Ulander⁵, Jean-Louis Reymond⁶, Hongming Chen³, Ola Engkvist³

Affiliations

¹ Hit Discovery, Discovery Sciences, IMED Biotech Unit, AstraZeneca, Gothenburg, Pepparedsleden 1, 43183, Mölndal, Sweden. josep.arus@dcb.unibe.ch.
² Department of Chemistry and Biochemistry, University of Bern, Freiestrasse 3, 3012, Bern, Switzerland. josep.arus@dcb.unibe.ch.
³ Hit Discovery, Discovery Sciences, IMED Biotech Unit, AstraZeneca, Gothenburg, Pepparedsleden 1, 43183, Mölndal, Sweden.
⁴ Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Endenicher Allee 19C, 53115, Bonn, Germany.
⁵ Medicinal Chemistry, Cardiovascular, Renal and Metabolism, IMED Biotech Unit, AstraZeneca, Gothenburg, Pepparedsleden 1, 43183, Mölndal, Sweden.
⁶ Department of Chemistry and Biochemistry, University of Bern, Freiestrasse 3, 3012, Bern, Switzerland.

PMID: 30868314
PMCID: PMC6419837
DOI: 10.1186/s13321-019-0341-z

Exploring the GDB-13 chemical space using deep generative models

Josep Arús-Pous et al. J Cheminform. 2019.

. 2019 Mar 12;11(1):20.

doi: 10.1186/s13321-019-0341-z.

Authors

Josep Arús-Pous^{1

2}, Thomas Blaschke^{3

4}, Silas Ulander⁵, Jean-Louis Reymond⁶, Hongming Chen³, Ola Engkvist³

Affiliations

¹ Hit Discovery, Discovery Sciences, IMED Biotech Unit, AstraZeneca, Gothenburg, Pepparedsleden 1, 43183, Mölndal, Sweden. josep.arus@dcb.unibe.ch.
² Department of Chemistry and Biochemistry, University of Bern, Freiestrasse 3, 3012, Bern, Switzerland. josep.arus@dcb.unibe.ch.
³ Hit Discovery, Discovery Sciences, IMED Biotech Unit, AstraZeneca, Gothenburg, Pepparedsleden 1, 43183, Mölndal, Sweden.
⁴ Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Endenicher Allee 19C, 53115, Bonn, Germany.
⁵ Medicinal Chemistry, Cardiovascular, Renal and Metabolism, IMED Biotech Unit, AstraZeneca, Gothenburg, Pepparedsleden 1, 43183, Mölndal, Sweden.
⁶ Department of Chemistry and Biochemistry, University of Bern, Freiestrasse 3, 3012, Bern, Switzerland.

PMID: 30868314
PMCID: PMC6419837
DOI: 10.1186/s13321-019-0341-z

Abstract

Recent applications of recurrent neural networks (RNN) enable training models that sample the chemical space. In this study we train RNN with molecular string representations (SMILES) with a subset of the enumerated database GDB-13 (975 million molecules). We show that a model trained with 1 million structures (0.1% of the database) reproduces 68.9% of the entire database after training, when sampling 2 billion molecules. We also developed a method to assess the quality of the training process using negative log-likelihood plots. Furthermore, we use a mathematical model based on the "coupon collector problem" that compares the trained model to an upper bound and thus we are able to quantify how much it has learned. We also suggest that this method can be used as a tool to benchmark the learning capabilities of any molecular generative model architecture. Additionally, an analysis of the generated chemical space was performed, which shows that, mostly due to the syntax of SMILES, complex molecules with many rings and heteroatoms are more difficult to sample.

Keywords: Chemical databases; Chemical space exploration; Deep generative models; Deep learning; Recurrent neural networks.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Representation as an Euler diagram of the domain of a RNN trained with SMILES strings. The sets are the following, ordered by their size: All possible strings generated by an RNN (red), all possible valid SMILES (yellow), all possible SMILES of GDB-13 molecules (light blue), all canonical SMILES of GDB-13 molecules (dark blue) and the training set (black). Note that the relative sizes of the different subsets do not reflect their true size

**Fig. 2**
Example of a forward pass of nicotine (CN1CCCC1c1cccnc1) on a trained model. The symbol sampled from the probability distribution at the step $i$ (highlighted in black) is input at the step $i + 1$ . This, with the hidden state (h_i), enables the model to have time-dynamic behavior. Note that sometimes tokens with lower probability are sampled (like in step 1) due to the multinomial sampling of the model. Also note that the probability distributions are not from real trained models and that the vocabulary used throughout this publication is much bigger

**Fig. 3**
Metrics used to evaluate the training process. The red line at epoch 70 represents the chosen epoch used in further tests. The negative log-likelihood (NLL) is calculated with natural logarithms. a 10 NLL plots of the training, validation and sampled sets every 25 epochs (from 1 to 200) and the chosen epoch (70). b JSD plot between the three NLL distributions from the previous section for each of the 200 epochs. c Percentage of valid molecules in each epoch. Notice that the plot already starts at around 96.5%. Mean (d) and variance (e) of the three distributions from section (a). Note that spikes around epochs 1–20 are statistical fluctuations common in the beginning of the training process of a RNN, when the learning rate is high

**Fig. 4**
Results from sampling 2 billion SMILES from the 1 M model every five epochs (from 1 to 195). The red line at epoch 70 represents the chosen epoch for further tests. a Percent of the total sample (2B) that are valid SMILES, canonical SMILES, in GDB-13 and out of GDB-13. Solid lines represent all SMILES sampled, including repeats, whereas dotted lines represent only the unique molecules obtained from the whole count. b Close-up percentage of GDB-13 obtained every five epochs. Notice that the plot starts at around 54% and that the drop around epoch 10 correlates with the training fluctuations already mentioned in Fig. 3

**Fig. 5**
a Histograms of the frequency of the RNN models (orange) and the theoretical (binomial) frequency distribution of the ideal model (blue). b Histograms of the average NLL per molecule (from the 25 models) for molecules with frequency 0, 5, 10, 15, 20 and 25 computed from a sample of 5 million molecules from GDB-13

**Fig. 6**
a–f MQN PCA plots (Explained variance: $P C A_{1} = 51.3 %, P C A_{2} = 12, 2 %$ ) calculated from a 130 million stratified sample of GDB-13 with 5 million molecules from each frequency value (0–25) colored by different descriptors. In all plots each pixel represents a group of similar molecules and its color represents the average value of a given descriptor. The colors rank from minimum to maximum: dark blue, cyan, green, yellow, orange, red and magenta. Each plot has the numeric range (min–max) between brackets after its title. Plots are colored by: a Number of trained models that generate each molecule. b Occupancy of every pixel. c Number of cyclic bonds. d Number of carbon atoms

**Fig. 7**
Plots of the frequency (left y axis) and the percent in database (right y axis) of 1 and 2-g in the canonical smiles of all GDB-13 molecules. The plot is sorted by the percentage present in the database. a Plot with the 1-g (tokens). In blue the mean frequency and in orange the percent of 1-g in database. Notice that the numeric tokens have been highlighted in red. b Plot with the 2-g mean frequency (blue) and percent (dashed orange). As the number of 2-g is too large (287), the x axis has been intentionally left blank and the mean frequency has been smoothed by an average window function size 8

**Fig. 8**
Distribution of a sample of 3 million molecules obtained from all the outside of GDB-13 sampled by the RNN model. a Histogram of the GDB-13 constraints broken by each molecule. Notice that a molecule can break more than one constraint. b Distribution of the number of GDB-13 constraints broken by each molecule

See this image and copyright information in PMC

Cited by

Identification of nanomolar adenosine A_2A receptor ligands using reinforcement learning and structure-based drug design.
Thomas M, Matricon PG, Gillespie RJ, Napiórkowska M, Neale H, Mason JS, Brown J, Harwood K, Fieldhouse C, Swain NA, Geng T, O'Boyle NM, Deflorian F, Bender A, de Graaf C. Thomas M, et al. Nat Commun. 2025 Jul 1;16(1):5485. doi: 10.1038/s41467-025-60629-0. Nat Commun. 2025. PMID: 40592852 Free PMC article.
Transfer Learning-Enhanced Prediction of Glass Transition Temperature in Bismaleimide-Based Polyimides.
Wang Z, Liu Y, Xu X, Zhang J, Li Z, Zheng L, Kang P. Wang Z, et al. Polymers (Basel). 2025 Jun 30;17(13):1833. doi: 10.3390/polym17131833. Polymers (Basel). 2025. PMID: 40647844 Free PMC article.
ChemSpaceAL: An Efficient Active Learning Methodology Applied to Protein-Specific Molecular Generation.
Kyro GW, Morgunov A, Brent RI, Batista VS. Kyro GW, et al. ArXiv [Preprint]. 2023 Dec 4:arXiv:2309.05853v2. ArXiv. 2023. Update in: J Chem Inf Model. 2024 Feb 12;64(3):653-665. doi: 10.1021/acs.jcim.3c01456. PMID: 37744464 Free PMC article. Updated. Preprint.
Substructure-based neural machine translation for retrosynthetic prediction.
Ucak UV, Kang T, Ko J, Lee J. Ucak UV, et al. J Cheminform. 2021 Jan 11;13(1):4. doi: 10.1186/s13321-020-00482-z. J Cheminform. 2021. PMID: 33431017 Free PMC article.
DeepGraphMolGen, a multi-objective, computational strategy for generating molecules with desirable properties: a graph convolution and reinforcement learning approach.
Khemchandani Y, O'Hagan S, Samanta S, Swainston N, Roberts TJ, Bollegala D, Kell DB. Khemchandani Y, et al. J Cheminform. 2020 Sep 4;12(1):53. doi: 10.1186/s13321-020-00454-3. J Cheminform. 2020. PMID: 33431037 Free PMC article.

See all "Cited by" articles

References

1. Ertl P. Cheminformatics analysis of organic substituents: identification of the most common substituents, calculation of substituent properties, and automatic identification of drug-like bioisosteric groups. J Chem Inf Comput Sci. 2003;43:374–380. doi: 10.1021/ci0255782. - DOI - PubMed
1. Van Deursen R, Reymond JL. Chemical space travel. ChemMedChem. 2007;2:636–640. doi: 10.1002/cmdc.200700021. - DOI - PubMed
1. Hartenfeller M, Zettl H, Walter M, et al. Dogs: reaction-driven de novo design of bioactive compounds. PLoS Comput Biol. 2012;8:e1002380. doi: 10.1371/journal.pcbi.1002380. - DOI - PMC - PubMed
1. Andersen JL, Flamm C, Merkle D, Stadler PF. Generic strategies for chemical space exploration. Int J Comput Biol Drug Des. 2014;7:225. doi: 10.1504/IJCBDD.2014.061649. - DOI - PubMed
1. Gaulton A, Bellis LJ, Bento AP, et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012;40:1100–1107. doi: 10.1093/nar/gkr777. - DOI - PMC - PubMed

Grants and funding

676434/H2020 Marie Skłodowska-Curie Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Exploring the GDB-13 chemical space using deep generative models

Affiliations

Exploring the GDB-13 chemical space using deep generative models

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources