Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Nov 18;11(2):577-586.
doi: 10.1039/c9sc04026a. eCollection 2020 Jan 14.

Constrained Bayesian optimization for automatic chemical design using variational autoencoders

Affiliations

Constrained Bayesian optimization for automatic chemical design using variational autoencoders

Ryan-Rhys Griffiths et al. Chem Sci. .

Abstract

Automatic Chemical Design is a framework for generating novel molecules with optimized properties. The original scheme, featuring Bayesian optimization over the latent space of a variational autoencoder, suffers from the pathology that it tends to produce invalid molecular structures. First, we demonstrate empirically that this pathology arises when the Bayesian optimization scheme queries latent space points far away from the data on which the variational autoencoder has been trained. Secondly, by reformulating the search procedure as a constrained Bayesian optimization problem, we show that the effects of this pathology can be mitigated, yielding marked improvements in the validity of the generated molecules. We posit that constrained Bayesian optimization is a good approach for solving this kind of training set mismatch in many generative tasks involving Bayesian optimization over the latent space of a variational autoencoder.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1. The SMILES representation and one-hot encoding for benzene. For purposes of illustration, only the characters present in benzene are shown in the one-hot encoding. In practice there is a column for each character in the SMILES alphabet.
Fig. 2
Fig. 2. The SMILES variational autoencoder with the learned constraint function illustrated by a circular feasible region in the latent space.
Fig. 3
Fig. 3. The dead zones in the latent space, adapted from ref. 21. The x and y axes are the principle components computed by PCA. The colour bar gives the log P value of the encoded latent points and the histograms show the coordinate-projected density of the latent points. One may observe that the encoded molecules are not distributed uniformly across the box constituting the bounds of the latent space.
Fig. 4
Fig. 4. Experiments on 5 disjoint sets comprising 50 latent points each. Very small (VS) noise are training data latent points with approximately 1% noise added to their values, small (S) noise have 10% noise added to their values and big (B) noise have 50% noise added to their values. All latent points underwent 500 decode attempts and the results are averaged over the 50 points in each set. The percentage of decodings to: (a) valid molecules (b) methane molecule, (c) realistic molecules.
Fig. 5
Fig. 5. (a) The percentage of latent points decoded to realistic molecules. (b) The percentage of latent points decoded to unique, novel realistic molecules. The results are from 20 iterations of Bayesian optimization with batches of 50 data points collected at each iteration (1000 latent points decoded in total). The standard error is given for 5 separate train/test set splits of 90/10.
Fig. 6
Fig. 6. The best scores for new molecules generated from the baseline model (blue) and the model with constrained Bayesian optimization (red). The vertical lines show the best scores averaged over 5 separate train/test splits of 90/10. For reference, the histograms are presented against the backdrop of the top 10% of the training data in the case of composite log P and QED, and the top 20% of the training data in the case of composite QED.
Fig. 7
Fig. 7. The best molecule obtained by constrained Bayesian optimization as judged by the penalised log P objective function score.
Fig. 8
Fig. 8. The best scores for novel molecules generated by the constrained Bayesian optimization model optimizing for PCE. The results are averaged over 3 separate runs with train/test splits of 90/10. The PCE score is normalized to zero mean and unit variance by the empirical mean and variance of the training set.

References

    1. Ryu S., Lim J., Hong S. H. and Kim W. Y., Deeply learning molecular structure-property relationships using attention-and gate-augmented graph convolutional network, arXiv preprint arXiv:1805.10988, 2018.
    1. Ryu J. Y., Kim H. U., Lee S. Y. Proc. Natl. Acad. Sci. U. S. A. 2018;115:E4304–E4311. - PMC - PubMed
    1. Turcani L., Greenaway R. L., Jelfs K. E. Chem. Mater. 2018;31:714–727.
    1. Dey S., Luo H., Fokoue A., Hu J., Zhang P. BMC Bioinf. 2018;19:476. - PMC - PubMed
    1. Coley C. W., Barzilay R., Green W. H., Jaakkola T. S., Jensen K. F. J. Chem. Inf. Model. 2017;57:1757–1772. - PubMed