Bayesian molecular design with a chemical language model

Hisaki Ikebata¹, Kenta Hongo^{2

3

4}, Tetsu Isomura⁵, Ryo Maezono², Ryo Yoshida^{6

7

8}

Affiliations

¹ The Graduate University for Advanced Studies (SOKENDAI), Tachikawa, Japan.
² Japan Advanced Institute of Science and Technology (JAIST), Nomi, Japan.
³ National Institute for Materials Science (NIMS), Tsukuba, Japan.
⁴ PRESTO, Japan Science and Technology Agency (JST), Kawaguchi, Japan.
⁵ The KAITEKI Institute, Inc., Tokyo, Japan.
⁶ The Graduate University for Advanced Studies (SOKENDAI), Tachikawa, Japan. yoshidar@ism.ac.jp.
⁷ National Institute for Materials Science (NIMS), Tsukuba, Japan. yoshidar@ism.ac.jp.
⁸ The Institute of Statistical Mathematics (ISM), Research Organization of Information and Systems, Tachikawa, Japan. yoshidar@ism.ac.jp.

PMID: 28281211
PMCID: PMC5393296
DOI: 10.1007/s10822-016-0008-z

Bayesian molecular design with a chemical language model

Hisaki Ikebata et al. J Comput Aided Mol Des. 2017 Apr.

. 2017 Apr;31(4):379-391.

doi: 10.1007/s10822-016-0008-z. Epub 2017 Mar 9.

Authors

Hisaki Ikebata¹, Kenta Hongo^{2

3

4}, Tetsu Isomura⁵, Ryo Maezono², Ryo Yoshida^{6

7

8}

Affiliations

¹ The Graduate University for Advanced Studies (SOKENDAI), Tachikawa, Japan.
² Japan Advanced Institute of Science and Technology (JAIST), Nomi, Japan.
³ National Institute for Materials Science (NIMS), Tsukuba, Japan.
⁴ PRESTO, Japan Science and Technology Agency (JST), Kawaguchi, Japan.
⁵ The KAITEKI Institute, Inc., Tokyo, Japan.
⁶ The Graduate University for Advanced Studies (SOKENDAI), Tachikawa, Japan. yoshidar@ism.ac.jp.
⁷ National Institute for Materials Science (NIMS), Tsukuba, Japan. yoshidar@ism.ac.jp.
⁸ The Institute of Statistical Mathematics (ISM), Research Organization of Information and Systems, Tachikawa, Japan. yoshidar@ism.ac.jp.

PMID: 28281211
PMCID: PMC5393296
DOI: 10.1007/s10822-016-0008-z

Abstract

The aim of computational molecular design is the identification of promising hypothetical molecules with a predefined set of desired properties. We address the issue of accelerating the material discovery with state-of-the-art machine learning techniques. The method involves two different types of prediction; the forward and backward predictions. The objective of the forward prediction is to create a set of machine learning models on various properties of a given molecule. Inverting the trained forward models through Bayes' law, we derive a posterior distribution for the backward prediction, which is conditioned by a desired property requirement. Exploring high-probability regions of the posterior with a sequential Monte Carlo technique, molecules that exhibit the desired properties can computationally be created. One major difficulty in the computational creation of molecules is the exclusion of the occurrence of chemically unfavorable structures. To circumvent this issue, we derive a chemical language model that acquires commonly occurring patterns of chemical fragments through natural language processing of ASCII strings of existing compounds, which follow the SMILES chemical language notation. In the backward prediction, the trained language model is used to refine chemical strings such that the properties of the resulting structures fall within the desired property region while chemically unfavorable structures are successfully removed. The present method is demonstrated through the design of small organic molecules with the property requirements on HOMO-LUMO gap and internal energy. The R package iqspr is available at the CRAN repository.

Keywords: Bayesian analysis; Inverse-QSPR; Molecular design; Natural language processing; SMILES; Small organic molecules.

PubMed Disclaimer

Figures

**Fig. 1**
Outline of the Bayesian molecular design method

**Fig. 2**
Illustration of the substring selector $ϕ_{n - 1} (\cdot)$ with three examples. In the contraction operation, a substring inside of the outermost closed parentheses (*green*) is reduced to the character in its first position (*red*). The extraction operation is to remove the rest (*black*) of the last $n - 1$ ( $= 9$ ) characters from the reduced string. The corresponding graphs are shown on the *right* where the atoms in the *boxes* indicate the last characters in the inputs of $ϕ_{n - 1} (\cdot)$ (*left*)

**Fig. 3**
a Perplexity scores (*left*) and valid grammar rate (1 − the syntax error rate) (*right*) with respect to 1000 SMILES strings generated from trained chemical language models. The conventional n-gram and the extended language models were trained with the BO and KN algorithms. The *error bars* represent the standard deviations across the 10 experiments corresponding to different training sets. b Examples of molecules generated from the trained chemical language model with $n = 10$ (*top*). The *bottom row* displays the most similar PubChem compounds that had the Tanimoto coefficient $\geq$ 0.9 on the PubChem fingerprint

**Fig. 4**
a Snapshots of structure alteration during the early phase of the inverse-QSPR calculation ( $t \in {10, 20, 50, 200}$ ) with the desired property region set to $U_{1}$ , $U_{2}$ or $U_{3}$ . The initial molecule (phenol) is shown at the *top*. The created molecules shown here were those ranked in the top four by the likelihood score at each t. Supplementary Movie 1–3 visualize the whole processes of structure modification over $t \in [1, 200]$ . b Property refinements resulting from the backward prediction at $t \in {1, 20, 50, 200}$ . Results on the three different property regions, $U_{1}$ , $U_{2}$ and $U_{3}$ , are displayed together, and color-coded by *red*, *green* and *blue*, respectively. The *shaded rectangles* indicate the target regions. The *dots* indicate the HOMO-LUMO gaps and internal energies of the designed molecules that were calculated by the predicted values of the QSPR models. For each $U_{i}$ and t, the 10 non-redundant molecules exhibiting the greater likelihoods are shown. c Properties of 50 molecules which were selected from the overall backward prediction process for $U_{1}$ (*red*), $U_{2}$ (*green*), and $U_{3}$ (*blue*). The HOMO-LUMO gap and internal energy were calculated by the trained QSPR models (*left*) and the DFT calculation (*right*). The *gray dots* indicate the training data points. In each $U_{i}$ , the 50 non-redundant molecules that achieved the highest likelihoods are shown. d Newly created molecules in the predefined property regions. The *bottom row* of each pair shows instances of significantly similar PubChem compounds that had the Tanimoto index $\geq 0.9$

See this image and copyright information in PMC

References

1. Brown N, McKay B, Gasteiger J. A novel workflow for the inverse QSPR problem using multiobjective optimization. J Comput Aided Mol Des. 2006;20:333–341. doi: 10.1007/s10822-006-9063-1. - DOI - PubMed
1. Nicolaou CA, Apostolakis J, Pattichis CS. De novo drug design using multiobjective evolutionary graphs. J Chem Inf Model. 2009;49:295–307. doi: 10.1021/ci800308h. - DOI - PubMed
1. Kawashita N, et al. A mini-review on chemoinformatics approaches for drug discovery. J Comput Aided Chem. 2015;16:15–29. doi: 10.2751/jcac.16.15. - DOI
1. Venkatasubramanian V, Chan K, Caruthers JM. Computer-aided molecular design using genetic algorithms. Comput Chem Eng. 1994;18:833–844. doi: 10.1016/0098-1354(93)E0023-3. - DOI
1. Venkatasubramanian V, Chan K, Caruthers JM. Evolutionary design of molecules with desired properties using the genetic algorithm. J Chem Inf Comput Sci. 1995;35:188–195. doi: 10.1021/ci00024a003. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Bayesian molecular design with a chemical language model

Affiliations

Bayesian molecular design with a chemical language model

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous