Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb 18;13(1):964.
doi: 10.1038/s41467-022-28536-w.

Biocatalysed synthesis planning using data-driven learning

Affiliations

Biocatalysed synthesis planning using data-driven learning

Daniel Probst et al. Nat Commun. .

Abstract

Enzyme catalysts are an integral part of green chemistry strategies towards a more sustainable and resource-efficient chemical synthesis. However, the use of biocatalysed reactions in retrosynthetic planning clashes with the difficulties in predicting the enzymatic activity on unreported substrates and enzyme-specific stereo- and regioselectivity. As of now, only rule-based systems support retrosynthetic planning using biocatalysis, while initial data-driven approaches are limited to forward predictions. Here, we extend the data-driven forward reaction as well as retrosynthetic pathway prediction models based on the Molecular Transformer architecture to biocatalysis. The enzymatic knowledge is learned from an extensive data set of publicly available biochemical reactions with the aid of a new class token scheme based on the enzyme commission classification number, which captures catalysis patterns among different enzymes belonging to the same hierarchy. The forward reaction prediction model (top-1 accuracy of 49.6%), the retrosynthetic pathway (top-1 single-step round-trip accuracy of 39.6%) and the curated data set are made publicly available to facilitate the adoption of enzymatic catalysis in the design of greener chemistry processes.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Introducing enzymes as green catalysts to data-driven template-free chemical synthesis.
The molecular transformer was trained on chemical reactions extracted from the USPTO data set and the new ECREACT data set using multitask transfer learning.
Fig. 2
Fig. 2. Enzyme class, substrate and product distributions of the data set ECREACT.
a The distribution of samples at EC-levels 1 (corresponding to enzyme classes) and 2 (corresponding to enzyme sub-classes) for oxidoreductases (class 1), transferases (class 2), hydrolases (class 3), lyases (class 4), isomerases (class 5), ligases (class 6) and translocases (class 7), in the ECREACT EC3 data set. A more extensive visualisation of the distribution of EC-levels 1, 2 and 3 can be found in Supplementary Fig. 3. TMAPs visualising the distribution of MAP4-encoded (b) reactants and (c) products in the ECREACT EC3 subset coloured by enzyme class corresponding to EC-level 1. Distributions of molecular distances (MAP4) per class are shown in Supplementary Fig. 1. While molecules of transferase- (class 2), lyase- (class 4), and, to a lesser extent, hydrolase-catalysed (class 3) reactions populate regions of the chemical space specific to each class (homogeneous), molecules from other classes are found in predominantly heterogeneous regions.
Fig. 3
Fig. 3. Overall accuracies of models based on different ECREACT token schemes EC0, EC1, EC2, EC3 and EC4.
Accuracies are reported for a forward prediction, b backward prediction, c round-trip prediction (a forward prediction followed by a backward prediction) and d backward EC number only prediction. Top-n indicates the accuracy when checking the top n predictions for the correct one.
Fig. 4
Fig. 4. Class-wise accuracy for the forward model trained on EC3.
a The top-k prediction accuracy for each class show significant differences among classes caused by the number of available samples per EC-level 3 category. The accuracy of b top-1 predictions per EC-level 3 category. Each dot represents an EC-level 3 subclass coloured by the number of test samples N. Large EC-level 3 subclasses (red) greatly influence the performance of predicting transferase-catalysed reaction (class 2) outcomes. Oxidoreductase-catalysed reactions (class 1) are distributed among many EC-level 3 subclasses, causing a lower performance compared to other classes with fewer samples overall. Detailed accuracies for top-2 and top-5 predictions can be found in Supplementary Fig. 4.
Fig. 5
Fig. 5. Inspection of forward predictions labelled as incorrect.
For each reaction, the ground truth is shown in black while the prediction is shown in red. The reactions are catalysed by (1, 2) oxidoreductases acting on the CH-NH2 group of donors with oxygen as acceptor, (3) a zeatin 9-aminocarboxyethyltransferase, (4) a cyclic-CMP phosphodiesterase, (5) a chloromuconate cycloisomerase, (6) and a pantothenate synthetase.
Fig. 6
Fig. 6. Analysis of the attention weights in the forward prediction models on reaction (6) from Supplementary Fig. 8.
The attention mapping between tokens representing EC numbers is highlighted in purple (reactant atom tokens are connected using grey curves). The curve thickness is proportional to the attention weight computed by the forward Molecular Transformer.
Fig. 7
Fig. 7. Class-wise accuracy for the backward model trained on EC3.
a The top-k prediction accuracies for each class (corresponding to EC-level 1) show significant differences among classes caused by the number of available samples per EC-level 3 category. The accuracy of b top-1 predictions per EC-level 3 category. Each dot represents an EC-level 3 category coloured by the number of test samples N. Large EC-level 3 subclasses (red) greatly influence the performance of predicting transferase-catalysed reaction (class 2) outcomes. Oxidoreductase-catalysed reactions (class 1) are distributed among many EC-level 3 subclasses, causing a lower performance compared to other classes with fewer samples overall. Detailed accuracies for top-2 and top-5 predictions can be found in Supplementary Fig. 9.
Fig. 8
Fig. 8. Inspection of backward predictions labelled as incorrect.
For each reaction, the ground truth is shown in black while the prediction is shown in red. The ground truth enzyme is marked with purple, the predicted enzyme with red. The model predicted (1, 4) different enzyme-catalysed reactions leading to the same product, (2) predicted a substrate with a different isomer, (3) corrected an erroneous data set entry and (5) was not able to predict an enzymatic reaction and fell back on a reaction learned from USPTO data.
Fig. 9
Fig. 9. Distribution of rxnfp fingerprints for the reactions in the combined space of ECREACT (grey) and RetroBioCat test set reactions (blue), embedded with TMAP.
a The reactions from the RetroBioCat test set are forming distinct clusters in the combined reaction space. b For RetroBioCat test set (blue) reactions, the fraction of nearest neighbours (k = 10) from the set itself is consistently higher compared to reactions from ECREACT (grey).
Fig. 10
Fig. 10. Enzyme-catalysed synthesis of synthetically useful compounds under mild conditions.
(1) Aminoalcohol, (2) Homoaspartate, (3) 4-hydroxy-L-glutamic acid, (4) β-ketoacid and (5) (S)-norlaudanosoline.

References

    1. Antony, T. Malthus foiled again and again. Nature418, 668–670 (2002). - PubMed
    1. Matlin, S. A. & Abegaz, B. M. In The Chemical Element: Chemistry’s Contribution to Our Global Future. (eds García-Martínez, J., Serrano-Torregrosa, E.) (Wiley-VCH, 2011).
    1. Zimmerman, J. B., Anastas, P. T., Erythropel, H. C. & Walter, L. Designing for a green chemistry future. Science367, 397–400 (2020). - PubMed
    1. Stanislav, M., Zbynek, P. & Jiri, D. Machine learning in enzyme engineering. ACS Catal.10, 1210–1223 (2020).
    1. Homaei, A. A., Reyhaneh, S., Fabio, V. & Roberto, S. Enzyme immobilization: an update. J. Chem. Biol.https://link.springer.com/article/10.1007/s12154-013-0102-9 (2013). - DOI - PMC - PubMed

Publication types