. 2022 Feb 18;13(1):964.

doi: 10.1038/s41467-022-28536-w.

Biocatalysed synthesis planning using data-driven learning

Daniel Probst^{1

2}, Matteo Manica³, Yves Gaetan Nana Teukam³, Alessandro Castrogiovanni^{3

4}, Federico Paratore³, Teodoro Laino^{3

4}

Affiliations

¹ IBM Research Europe, CH-8803, Rüschlikon, Switzerland. dpr@zurich.ibm.com.
² National Center for Competence in Research-Catalysis (NCCR-Catalysis), Rüschlikon, Switzerland. dpr@zurich.ibm.com.
³ IBM Research Europe, CH-8803, Rüschlikon, Switzerland.
⁴ National Center for Competence in Research-Catalysis (NCCR-Catalysis), Rüschlikon, Switzerland.

PMID: 35181654
PMCID: PMC8857209
DOI: 10.1038/s41467-022-28536-w

Biocatalysed synthesis planning using data-driven learning

Daniel Probst et al. Nat Commun. 2022.

. 2022 Feb 18;13(1):964.

doi: 10.1038/s41467-022-28536-w.

Authors

Daniel Probst^{1

2}, Matteo Manica³, Yves Gaetan Nana Teukam³, Alessandro Castrogiovanni^{3

4}, Federico Paratore³, Teodoro Laino^{3

4}

Affiliations

¹ IBM Research Europe, CH-8803, Rüschlikon, Switzerland. dpr@zurich.ibm.com.
² National Center for Competence in Research-Catalysis (NCCR-Catalysis), Rüschlikon, Switzerland. dpr@zurich.ibm.com.
³ IBM Research Europe, CH-8803, Rüschlikon, Switzerland.
⁴ National Center for Competence in Research-Catalysis (NCCR-Catalysis), Rüschlikon, Switzerland.

PMID: 35181654
PMCID: PMC8857209
DOI: 10.1038/s41467-022-28536-w

Abstract

Enzyme catalysts are an integral part of green chemistry strategies towards a more sustainable and resource-efficient chemical synthesis. However, the use of biocatalysed reactions in retrosynthetic planning clashes with the difficulties in predicting the enzymatic activity on unreported substrates and enzyme-specific stereo- and regioselectivity. As of now, only rule-based systems support retrosynthetic planning using biocatalysis, while initial data-driven approaches are limited to forward predictions. Here, we extend the data-driven forward reaction as well as retrosynthetic pathway prediction models based on the Molecular Transformer architecture to biocatalysis. The enzymatic knowledge is learned from an extensive data set of publicly available biochemical reactions with the aid of a new class token scheme based on the enzyme commission classification number, which captures catalysis patterns among different enzymes belonging to the same hierarchy. The forward reaction prediction model (top-1 accuracy of 49.6%), the retrosynthetic pathway (top-1 single-step round-trip accuracy of 39.6%) and the curated data set are made publicly available to facilitate the adoption of enzymatic catalysis in the design of greener chemistry processes.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. Introducing enzymes as green catalysts to data-driven template-free chemical synthesis.**
The molecular transformer was trained on chemical reactions extracted from the USPTO data set and the new ECREACT data set using multitask transfer learning.

**Fig. 2. Enzyme class, substrate and product distributions of the data set ECREACT.**
a The distribution of samples at EC-levels 1 (corresponding to enzyme classes) and 2 (corresponding to enzyme sub-classes) for oxidoreductases (class 1), transferases (class 2), hydrolases (class 3), lyases (class 4), isomerases (class 5), ligases (class 6) and translocases (class 7), in the ECREACT EC3 data set. A more extensive visualisation of the distribution of EC-levels 1, 2 and 3 can be found in Supplementary Fig. 3. TMAPs visualising the distribution of MAP4-encoded (b) reactants and (c) products in the ECREACT EC3 subset coloured by enzyme class corresponding to EC-level 1. Distributions of molecular distances (MAP4) per class are shown in Supplementary Fig. 1. While molecules of transferase- (class 2), lyase- (class 4), and, to a lesser extent, hydrolase-catalysed (class 3) reactions populate regions of the chemical space specific to each class (homogeneous), molecules from other classes are found in predominantly heterogeneous regions.

**Fig. 3. Overall accuracies of models based on different ECREACT token schemes EC0, EC1, EC2, EC3 and EC4.**
Accuracies are reported for a forward prediction, b backward prediction, c round-trip prediction (a forward prediction followed by a backward prediction) and d backward EC number only prediction. Top-n indicates the accuracy when checking the top n predictions for the correct one.

**Fig. 4. Class-wise accuracy for the forward model trained on EC3.**
a The top-k prediction accuracy for each class show significant differences among classes caused by the number of available samples per EC-level 3 category. The accuracy of b top-1 predictions per EC-level 3 category. Each dot represents an EC-level 3 subclass coloured by the number of test samples N. Large EC-level 3 subclasses (red) greatly influence the performance of predicting transferase-catalysed reaction (class 2) outcomes. Oxidoreductase-catalysed reactions (class 1) are distributed among many EC-level 3 subclasses, causing a lower performance compared to other classes with fewer samples overall. Detailed accuracies for top-2 and top-5 predictions can be found in Supplementary Fig. 4.

**Fig. 5. Inspection of forward predictions labelled as incorrect.**
For each reaction, the ground truth is shown in black while the prediction is shown in red. The reactions are catalysed by (1, 2) oxidoreductases acting on the CH-NH₂ group of donors with oxygen as acceptor, (3) a zeatin 9-aminocarboxyethyltransferase, (4) a cyclic-CMP phosphodiesterase, (5) a chloromuconate cycloisomerase, (6) and a pantothenate synthetase.

**Fig. 6. Analysis of the attention weights in the forward prediction models on reaction (6) from Supplementary Fig. 8.**
The attention mapping between tokens representing EC numbers is highlighted in purple (reactant atom tokens are connected using grey curves). The curve thickness is proportional to the attention weight computed by the forward Molecular Transformer.

**Fig. 7. Class-wise accuracy for the backward model trained on EC3.**
a The top-k prediction accuracies for each class (corresponding to EC-level 1) show significant differences among classes caused by the number of available samples per EC-level 3 category. The accuracy of b top-1 predictions per EC-level 3 category. Each dot represents an EC-level 3 category coloured by the number of test samples N. Large EC-level 3 subclasses (red) greatly influence the performance of predicting transferase-catalysed reaction (class 2) outcomes. Oxidoreductase-catalysed reactions (class 1) are distributed among many EC-level 3 subclasses, causing a lower performance compared to other classes with fewer samples overall. Detailed accuracies for top-2 and top-5 predictions can be found in Supplementary Fig. 9.

**Fig. 8. Inspection of backward predictions labelled as incorrect.**
For each reaction, the ground truth is shown in black while the prediction is shown in red. The ground truth enzyme is marked with purple, the predicted enzyme with red. The model predicted (1, 4) different enzyme-catalysed reactions leading to the same product, (2) predicted a substrate with a different isomer, (3) corrected an erroneous data set entry and (5) was not able to predict an enzymatic reaction and fell back on a reaction learned from USPTO data.

**Fig. 9. Distribution of rxnfp fingerprints for the reactions in the combined space of ECREACT (grey) and RetroBioCat test set reactions (blue), embedded with TMAP.**
a The reactions from the RetroBioCat test set are forming distinct clusters in the combined reaction space. b For RetroBioCat test set (blue) reactions, the fraction of nearest neighbours (k = 10) from the set itself is consistently higher compared to reactions from ECREACT (grey).

**Fig. 10. Enzyme-catalysed synthesis of synthetically useful compounds under mild conditions.**
(1) Aminoalcohol, (2) Homoaspartate, (3) 4-hydroxy-L-glutamic acid, (4) β-ketoacid and (5) (S)-norlaudanosoline.

See this image and copyright information in PMC

References

1. Antony, T. Malthus foiled again and again. Nature418, 668–670 (2002). - PubMed
1. Matlin, S. A. & Abegaz, B. M. In The Chemical Element: Chemistry’s Contribution to Our Global Future. (eds García-Martínez, J., Serrano-Torregrosa, E.) (Wiley-VCH, 2011).
1. Zimmerman, J. B., Anastas, P. T., Erythropel, H. C. & Walter, L. Designing for a green chemistry future. Science367, 397–400 (2020). - PubMed
1. Stanislav, M., Zbynek, P. & Jiri, D. Machine learning in enzyme engineering. ACS Catal.10, 1210–1223 (2020).
1. Homaei, A. A., Reyhaneh, S., Fabio, V. & Roberto, S. Enzyme immobilization: an update. J. Chem. Biol.https://link.springer.com/article/10.1007/s12154-013-0102-9 (2013). - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Biocatalysed synthesis planning using data-driven learning

Affiliations

Biocatalysed synthesis planning using data-driven learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials