Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Apr;31(4):379-391.
doi: 10.1007/s10822-016-0008-z. Epub 2017 Mar 9.

Bayesian molecular design with a chemical language model

Affiliations

Bayesian molecular design with a chemical language model

Hisaki Ikebata et al. J Comput Aided Mol Des. 2017 Apr.

Abstract

The aim of computational molecular design is the identification of promising hypothetical molecules with a predefined set of desired properties. We address the issue of accelerating the material discovery with state-of-the-art machine learning techniques. The method involves two different types of prediction; the forward and backward predictions. The objective of the forward prediction is to create a set of machine learning models on various properties of a given molecule. Inverting the trained forward models through Bayes' law, we derive a posterior distribution for the backward prediction, which is conditioned by a desired property requirement. Exploring high-probability regions of the posterior with a sequential Monte Carlo technique, molecules that exhibit the desired properties can computationally be created. One major difficulty in the computational creation of molecules is the exclusion of the occurrence of chemically unfavorable structures. To circumvent this issue, we derive a chemical language model that acquires commonly occurring patterns of chemical fragments through natural language processing of ASCII strings of existing compounds, which follow the SMILES chemical language notation. In the backward prediction, the trained language model is used to refine chemical strings such that the properties of the resulting structures fall within the desired property region while chemically unfavorable structures are successfully removed. The present method is demonstrated through the design of small organic molecules with the property requirements on HOMO-LUMO gap and internal energy. The R package iqspr is available at the CRAN repository.

Keywords: Bayesian analysis; Inverse-QSPR; Molecular design; Natural language processing; SMILES; Small organic molecules.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Outline of the Bayesian molecular design method
Fig. 2
Fig. 2
Illustration of the substring selector ϕn-1(·) with three examples. In the contraction operation, a substring inside of the outermost closed parentheses (green) is reduced to the character in its first position (red). The extraction operation is to remove the rest (black) of the last n-1 (=9) characters from the reduced string. The corresponding graphs are shown on the right where the atoms in the boxes indicate the last characters in the inputs of ϕn-1(·) (left)
Fig. 3
Fig. 3
a Perplexity scores (left) and valid grammar rate (1 − the syntax error rate) (right) with respect to 1000 SMILES strings generated from trained chemical language models. The conventional n-gram and the extended language models were trained with the BO and KN algorithms. The error bars represent the standard deviations across the 10 experiments corresponding to different training sets. b Examples of molecules generated from the trained chemical language model with n=10 (top). The bottom row displays the most similar PubChem compounds that had the Tanimoto coefficient 0.9 on the PubChem fingerprint
Fig. 4
Fig. 4
a Snapshots of structure alteration during the early phase of the inverse-QSPR calculation (t{10,20,50,200}) with the desired property region set to U1, U2 or U3. The initial molecule (phenol) is shown at the top. The created molecules shown here were those ranked in the top four by the likelihood score at each t. Supplementary Movie 1–3 visualize the whole processes of structure modification over t[1,200]. b Property refinements resulting from the backward prediction at t{1,20,50,200}. Results on the three different property regions, U1, U2 and U3, are displayed together, and color-coded by red, green and blue, respectively. The shaded rectangles indicate the target regions. The dots indicate the HOMO-LUMO gaps and internal energies of the designed molecules that were calculated by the predicted values of the QSPR models. For each Ui and t, the 10 non-redundant molecules exhibiting the greater likelihoods are shown. c Properties of 50 molecules which were selected from the overall backward prediction process for U1 (red), U2 (green), and U3 (blue). The HOMO-LUMO gap and internal energy were calculated by the trained QSPR models (left) and the DFT calculation (right). The gray dots indicate the training data points. In each Ui, the 50 non-redundant molecules that achieved the highest likelihoods are shown. d Newly created molecules in the predefined property regions. The bottom row of each pair shows instances of significantly similar PubChem compounds that had the Tanimoto index 0.9

References

    1. Brown N, McKay B, Gasteiger J. A novel workflow for the inverse QSPR problem using multiobjective optimization. J Comput Aided Mol Des. 2006;20:333–341. doi: 10.1007/s10822-006-9063-1. - DOI - PubMed
    1. Nicolaou CA, Apostolakis J, Pattichis CS. De novo drug design using multiobjective evolutionary graphs. J Chem Inf Model. 2009;49:295–307. doi: 10.1021/ci800308h. - DOI - PubMed
    1. Kawashita N, et al. A mini-review on chemoinformatics approaches for drug discovery. J Comput Aided Chem. 2015;16:15–29. doi: 10.2751/jcac.16.15. - DOI
    1. Venkatasubramanian V, Chan K, Caruthers JM. Computer-aided molecular design using genetic algorithms. Comput Chem Eng. 1994;18:833–844. doi: 10.1016/0098-1354(93)E0023-3. - DOI
    1. Venkatasubramanian V, Chan K, Caruthers JM. Evolutionary design of molecules with desired properties using the genetic algorithm. J Chem Inf Comput Sci. 1995;35:188–195. doi: 10.1021/ci00024a003. - DOI

Substances