. 2011 Sep 26;51(9):2209-22.

doi: 10.1021/ci200207y. Epub 2011 Sep 2.

Learning to predict chemical reactions

Matthew A Kayala¹, Chloé-Agathe Azencott, Jonathan H Chen, Pierre Baldi

Affiliations

PMID: 21819139
PMCID: PMC3193800
DOI: 10.1021/ci200207y

Learning to predict chemical reactions

Matthew A Kayala et al. J Chem Inf Model. 2011.

. 2011 Sep 26;51(9):2209-22.

doi: 10.1021/ci200207y. Epub 2011 Sep 2.

Authors

Matthew A Kayala¹, Chloé-Agathe Azencott, Jonathan H Chen, Pierre Baldi

Affiliation

¹ Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California, Irvine, California, United States.

PMID: 21819139
PMCID: PMC3193800
DOI: 10.1021/ci200207y

Abstract

Being able to predict the course of arbitrary chemical reactions is essential to the theory and applications of organic chemistry. Approaches to the reaction prediction problems can be organized around three poles corresponding to: (1) physical laws; (2) rule-based expert systems; and (3) inductive machine learning. Previous approaches at these poles, respectively, are not high throughput, are not generalizable or scalable, and lack sufficient data and structure to be implemented. We propose a new approach to reaction prediction utilizing elements from each pole. Using a physically inspired conceptualization, we describe single mechanistic reactions as interactions between coarse approximations of molecular orbitals (MOs) and use topological and physicochemical attributes as descriptors. Using an existing rule-based system (Reaction Explorer), we derive a restricted chemistry data set consisting of 1630 full multistep reactions with 2358 distinct starting materials and intermediates, associated with 2989 productive mechanistic steps and 6.14 million unproductive mechanistic steps. And from machine learning, we pose identifying productive mechanistic steps as a statistical ranking, information retrieval problem: given a set of reactants and a description of conditions, learn a ranking model over potential filled-to-unfilled MO interactions such that the top-ranked mechanistic steps yield the major products. The machine learning implementation follows a two-stage approach, in which we first train atom level reactivity filters to prune 94.00% of nonproductive reactions with a 0.01% error rate. Then, we train an ensemble of ranking models on pairs of interacting MOs to learn a relative productivity function over mechanistic steps in a given system. Without the use of explicit transformation patterns, the ensemble perfectly ranks the productive mechanism at the top 89.05% of the time, rising to 99.86% of the time when the top four are considered. Furthermore, the system is generalizable, making reasonable predictions over reactants and conditions which the rule-based expert does not handle. A web interface to the machine learning based mechanistic reaction predictor is accessible through our chemoinformatics portal ( http://cdb.ics.uci.edu) under the Toolkits section.

PubMed Disclaimer

Figures

**Figure 1**
Example overall transformation and corresponding elementary, mechanistic reactions. (a) The overall transformation of an alkene with a hydrobromic acid. This is a single graph rearrangement representation of a multi-step reaction. (b) The details of the two mechanistic reactions which compose the overall transformation. The first involves a proton transfer reaction, and the second involves the addition of the bromide anion. Each detailed mechanism is an example of an “arrow-pushing” diagram involving a single transition state, in which each arrow denotes the movement of a pair of electrons, and multiple arrows on a single diagram denote concerted movement.

**Figure 2**
Overall reaction prediction framework. (a) A user inputs the reactants and conditions. (b) We identify potential electron donors and acceptors using coarse approximations of electron filled and electron unfilled MOs. (c) Highly sensitive reactive site classifiers are trained and used to filter out the vast majority of unreactive sites, pruning the space of potential reactions. (d) Reactions are enumerated by pairing filled and unfilled MOs. (e) A ranking model is trained and used to order the reactions, where the best ranking one or few represent the major products. The top ranked product can be recursively chained to a new instance of the framework for multi-step reaction prediction.

**Figure 3**
Molecular orbital types in the augmented molecular graph for the core reaction model. Unfilled molecular orbital types are on the top and filled types are on the bottom.

**Figure 4**
The filled and unfilled orbitals yielded for C₂. Note the π bond adjacent to C₄ acts as either a filled or unfilled *chain* orbital.

**Figure 5**
Extended orbital chain interaction examples. (a) Enolate reacting as a lone pair, π-bond chain. Chaining is necessary to capture the implicit pre-reaction resonance rearrangement. (b) E2 elimination where the H-C σ-bond chains into the C-Br σ-bond. The central bond in each chain simultaneously acts as an electron source and sink at different points in the overall flow.

**Figure 6**
Topological based count features. Diisopropylamide anion is pictured in a cartoon format with all the distinct path types starting at the nitrogen. Sub-trees (not shown) are rooted at the atom with out-degree at most 2.

**Figure 7**
Reactive site predictions using models trained with all the data. The histograms show the distribution of prediction values on the unreactive labeled data. The red points show the prediction values for individual reactive labeled data points jittered for clarity. (a) shows the filled site predictions, while (b) shows the same plot for the unfilled site predictions.

**Figure 8**
Shared weight artificial neural network architecture. Two shared weight artificial neural networks are connected to a sigmoidal output layer with fixed weights. The output of the final network will approach 1 if the input to the left network is scored greater than the input to the right network, and 0 otherwise. As the lower level networks share weights, they compute the same scoring function.

**Figure 9**
Multi-Step Reaction Prediction. An example of a correctly predicted multi-step reaction from a careful validation experiment. All reactants shown were held out in a special testing set, while all other data in the Reaction Explorer system is used as a training set. Thus, the predictions shown are not seen in training. The products from the top ranked reaction are recursively input to a new instance of the overall pipeline to make a multi-step reaction predictor. The error rate is low enough to make the system usable for prediction of overall transformations.

**Figure 10**
Two correctly ranked ring-forming reactions in cross-validation experiments. These two reactions are labeled productive by Reaction Explorer with the “Mix Reactants, Polar Protic” reagent model. Without seeing these reactions during training, our approach inductively learns to correctly rank these two reactions as the most productive with the corresponding conditions. The system also correctly returns reasonable ring-forming reactions as the second highest ranked for both sets of reactants.

**Figure 11**
Reasonable reactions not returned by Reaction Explorer, but highly ranked by our system. The reaction conditions for both systems corresponds to the standard conditions from the “Mix Reactants, Polar Protic” Reaction Explorer reagent model. (a) and (c) are the top ranked reactions over the 7-bromohept-1-en-2-olate and 8-bromooct-1-en-2-olate reactants respectively, while (b) and (d) are the second ranked reactions over their respective reactants. Neither set of reactants are included in the training set of productive reactions.

**Figure 12**
Within-1 ranked system with a reasonable mechanism in cross-validation experiments. The deprotonation is ranked slightly higher than the substitution, although the Reaction Explorer system labels the substitution as productive and does not predict the deprotonation. However, the previous step in this test case was the protonation of the alchohol. As this hydrogen transfer reaction is reversible, the deprotonation is kinetically favorable, just not productive.

**Figure 13**
Within-1 ranked system with a reasonable mechanism in cross-validation experiments. The top two ranked reactions with 2,7-dimethyloct-2-ene, an intermediate in a Reaction Explorer multi-step reaction. Reaction Explorer labels only the 6-member ring forming reaction as productive. Although this leaves a 2’ carbocation, it is considered productive because of future methyl shifts in the underlying Reaction Explorer reaction sequence. We consistently rank the reaction yielding a 5-member ring and 3’ carbocation higher in cross-validation experiments where 2,7-dimethyloct-2-ene is in the test set. Although this is considered an error, it is a reasonable one.

**Figure 14**
Within-n predicted reaction recovery for different reaction conditions over cross-validation experiments. The fraction of reactant systems in which all productive reactions are recovered is presented on the y-axis, and the n is presented on the x-axis. Color and symbols are used to denoted different reaction conditions. The number of queries with the given reaction conditions are presented in parentheses after the conditions name. Details of the reaction conditions and how they map back to the Reaction Explorer reagent models are presented in Table S1.

See this image and copyright information in PMC

References

1. Cembran A, Song L, Mo Y, Gao J. Block-localized density functional theory (BLDFT), diabatic coupling, and their use in valence bond theory for representing reactive potential energy surfaces. J Chem Theory Comput. 2009;5:2702–2716. - PMC - PubMed
1. Lu Z, Yang W. Reaction path potential for complex systems derived from combined ab initio quantum mechanical and molecular mechanical calculations. J Chem Phys. 2004;121:89–100. - PubMed
1. Peters B, Heyden A, Bell AT, Chakraborty A. A growing string method for determining transition states: comparison to the nudged elastic band and string methods. J Chem Phys. 2004;120:7877–7786. - PubMed
1. Cramer C. Essentials of Computational Chemistry: Theories and Models. 2. Wiley; West Sussex, England: 2004.
1. Henkelman G, Uberuaga BP, Jónsson H. A climbing image nudged elastic band method for finding saddle points and minimum energy paths. J Chem Phys. 2000;113:9901–9904.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Learning to predict chemical reactions

Affiliation

Learning to predict chemical reactions

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources