Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Sep 14;11(40):10959-10972.
doi: 10.1039/d0sc04184j.

Towards efficient discovery of green synthetic pathways with Monte Carlo tree search and reinforcement learning

Affiliations

Towards efficient discovery of green synthetic pathways with Monte Carlo tree search and reinforcement learning

Xiaoxue Wang et al. Chem Sci. .

Abstract

Computer aided synthesis planning of synthetic pathways with green process conditions has become of increasing importance in organic chemistry, but the large search space inherent in synthesis planning and the difficulty in predicting reaction conditions make it a significant challenge. We introduce a new Monte Carlo Tree Search (MCTS) variant that promotes balance between exploration and exploitation across the synthesis space. Together with a value network trained from reinforcement learning and a solvent-prediction neural network, our algorithm is comparable to the best MCTS variant (PUCT, similar to Google's Alpha Go) in finding valid synthesis pathways within a fixed searching time, and superior in identifying shorter routes with greener solvents under the same search conditions. In addition, with the same root compound visit count, our algorithm outperforms the PUCT MCTS by 16% in terms of determining successful routes. Overall the success rate is improved by 19.7% compared to the upper confidence bound applied to trees (UCT) MCTS method. Moreover, we improve 71.4% of the routes proposed by the PUCT MCTS variant in pathway length and choices of green solvents. The approach generally enables including Green Chemistry considerations in computer aided synthesis planning with potential applications in process development for fine chemicals or pharmaceuticals.

PubMed Disclaimer

Conflict of interest statement

There are no conflicts to declare.

Figures

Fig. 1
Fig. 1. The process of Monte Carlo Tree Search in synthesis planning. Following the notations of MDP, a molecule (or state) is denoted as s, and a template (or retrosynthetic disconnection action) is denoted as a. In the selection phase, starting from the target molecule, the most “promising” template is recursively chosen by selecting the template with the highest upper confidence bound (UCB(s,a)) value until a leaf node is reached. A policy network is used to narrow down the search beam in each template selection step. In the expansion phase, the leaf node is expanded by applying the selected template. New leaf nodes (precursors) that are not visited by the tree expander before are generated. Once the new leaf nodes are encountered, in the evaluation step, a value network is used to evaluate the values of the leaf nodes (if the node is buyable, the value is set to 1). Then in the backpropagation step, upward along the tree, the visit count N(s,a) of each compound-template (s,a) pairs, or edges, are updated. The Q(s,a) value (see Table 1) is recalculated as well and used to recompute UCB(s,a) values in the next selection step. With the updated values, the tree expander goes back to the selection phase, starting selecting the most promising template for the target molecule (root node) again. Here circles denote compounds. (Blue) not commercially available; (Green) commercially available.
Fig. 2
Fig. 2. Bootstrapping process and the reinforcement learning process to train a value network. (a) The bottom-up propagation of z(s) value for bootstrapping. If the route is not from buyable precursors, the z value for all non-buyable compounds are zero (left). If the route is from buyable precursors, starting from the leaf buyable precursors (z = 1), the z value of a compound in the tree will be assigned as the average z value of the compounds' immediate precursors times a discount factor γ (0< γ < 1). If another route under the same compound generates higher z value than the current route, the z value of the compound will be updated to the larger value. Here circles denote compounds: blue circles are compounds that are not commercially available and green ones are buyable compounds. The triangles denote the templates a, through which compounds are transformed into corresponding precursors. (b) The RL process to train the value network. With the z value sampled in (b) from MCTS, a value network can be trained so that we can map vθ(s) to z(s).
Fig. 3
Fig. 3. The success rate of finding buyable synthesis pathways by MCTS variants. Here for the modified UCT with dynamic c tuning and value network (mUCT-dc-V), c value is initialized with 0.1. For all other UCT type MCTS variants, c = 0.1. For both PUCT type of MCTS, c = 1. The value network used here is the Round 1 RL value network. (a) The performance of MCTS expansions for 30 s on test and training sets. The values of the compounds in the buyable catalogue are set to 1 and overrides the value given by the value network. The success rates of the mUCT-dc-V method and the PUCT-V method out stands from all the variants. (b)The success rates of MCTS expansions with a fixed root visit count of 5000 on 1000 compounds, which is the same test set as (a). mUCT-dc-V significantly outperforms all other MCTS variants.
Fig. 4
Fig. 4. Examples of chemical routes that mUCT-dc-V method can solve within 30 s while PUCT-V cannot. The value of the P(s,a) given by the policy network and the ranking of the template among the top 50 templates are given. The unique advantage of mUCT-dc-V method is that the P(s,a) value is not explicitly used, therefore even if the P(s,a) value is extremely small as a result of the imperfect policy network, the valid template will still be explored by the tree expander, which is not the case in PUCT-V method. The value network here is Round 1 RL value network. The policy network used by both MCTS variants are the same. The restrictions for both MCTS variants are the same: top 50 templates given by policy network are considered, maximum depth is 10, and minimum plausibility is 0.75 (see Methods). The affected functional groups in each step are marked in blue. The buyable compounds are framed in green.
Fig. 5
Fig. 5. Using MCTS to find short synthesis pathways using green solvents. (a) Assigning scores for solvents in the solvent database. (b) Using the prediction of the solvent prediction model to define the reaction solvent score. The suggested top three solvents are shown with the probabilities listed and solvent scores in parentheses. The reaction solvent score (RSS) is defined as the weighted average of the top three solvent scores. (c) Converting the reactions solvent score to reaction solvent penalty (R solvent penalty), then defining the compound solvent score using the R solvent penalty. The compound solvent score (CSS) is defined as the maximal cumulated RSS in a valid pathway. The greenest route for a compound is the path which lead to the CSS. CSS is a function of tree expander and the root compound. It is essential to optimize the tree expander so that CSS can be optimized.
Fig. 6
Fig. 6. The greenness of the synthetic routes (compound solvent score (CSS) of the root compound) generated by mUCT-dc-V when compared with PUCT-V as baseline method. Both algorithms use Round 1 RL value network and the tree expansion is restricted within 30 s. 71.4% of the cases show higher root CSS generated by mUCT-dc-V than by PUCT-V.
Fig. 7
Fig. 7. Case study of the greenest routes generated by PUCT-V and mUCT-dc-V algorithms. The root CSS is the compound solvent score (CSS) of the root compound, which reflects the overall greenness of the best route in the tree. Typically PUCT-V algorithm generates much longer synthetic routes with resultant accumulative reaction penalties, or CSS, much more negative than the routes generated by mUCT-dc-V algorithm. Orange framed compounds are the most probable solvent suggested by the solvent prediction network. Note that the reaction solvent score (RSS) is the weighted average of the top three solvents suggested, therefore even the top 1 solvents are the same for two reactions, their RSS'es may vary, and therefore their reaction penalties may vary. Green framed compounds are commercially available compounds. The affected functional groups in each step are marked in blue.
Fig. 8
Fig. 8. The minimum visit count N(s,a) required by the first switching as a function of Q/c value.
Fig. 9
Fig. 9. The total visit count of each template before all templates are visited if c = current max Q/2. (a) Average visit counts, (b) maximal visit counts. The results are obtained from 104 random simulations.
Fig. 10
Fig. 10. The dynamic method to decide the value of c. We define c as half of the current max Q(s,b) value during the tree expansion process during which the visit count of the compound s increases.
Fig. 11
Fig. 11. Architecture of the value network.

Similar articles

Cited by

References

    1. Segler M. H. S. Preuss M. Waller M. P. Nature. 2018;555:604–610. - PubMed
    1. Baylon J. L. Cilfone N. A. Gulcher J. R. Chittenden T. W. J. Chem. Inf. Model. 2019;59:673–688. - PubMed
    1. Schreck J. S. Coley C. W. Bishop K. J. M. ACS Cent. Sci. 2019;5:970–981. - PMC - PubMed
    1. Cook A. Johnson A. P. Law J. Mirzazadeh M. Ravitz O. Simon A. WIREs Comput. Mol. Sci. 2012;2:79–107.
    1. Szymkuć S. Gajewska E. P. Klucznik T. Molga K. Dittwald P. Startek M. Bajczyk M. Grzybowski B. A. Angew. Chem., Int. Ed. 2016;55:5904–5937. - PubMed