Probing machine learning models based on high throughput experimentation data for the discovery of asymmetric hydrogenation catalysts

Adarsh V Kalikadien¹, Cecile Valsecchi², Robbert van Putten³, Tor Maes³, Mikko Muuronen³, Natalia Dyubankova³, Laurent Lefort³, Evgeny A Pidko¹

Affiliations

¹ Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology Van der Maasweg 9, 2629 HZ Delft The Netherlands e.a.pidko@tudelft.nl.
² Discovery, Product Development and Supply, Janssen Cilag S.p.A. Viale Fulvio Testi, 280/6 20126 Milano Italy.
³ Discovery, Product Development and Supply, Janssen Pharmaceutica N.V. Turnhoutseweg 30 2340 Beerse Belgium llefort@its.jnj.com.

PMID: 39211503
PMCID: PMC11352728
DOI: 10.1039/d4sc03647f

Probing machine learning models based on high throughput experimentation data for the discovery of asymmetric hydrogenation catalysts

Adarsh V Kalikadien et al. Chem Sci. 2024.

. 2024 Jul 16;15(34):13618-13630.

doi: 10.1039/d4sc03647f. eCollection 2024 Aug 28.

Authors

Adarsh V Kalikadien¹, Cecile Valsecchi², Robbert van Putten³, Tor Maes³, Mikko Muuronen³, Natalia Dyubankova³, Laurent Lefort³, Evgeny A Pidko¹

Affiliations

¹ Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology Van der Maasweg 9, 2629 HZ Delft The Netherlands e.a.pidko@tudelft.nl.
² Discovery, Product Development and Supply, Janssen Cilag S.p.A. Viale Fulvio Testi, 280/6 20126 Milano Italy.
³ Discovery, Product Development and Supply, Janssen Pharmaceutica N.V. Turnhoutseweg 30 2340 Beerse Belgium llefort@its.jnj.com.

PMID: 39211503
PMCID: PMC11352728
DOI: 10.1039/d4sc03647f

Abstract

Enantioselective hydrogenation of olefins by Rh-based chiral catalysts has been extensively studied for more than 50 years. Naively, one would expect that everything about this transformation is known and that selecting a catalyst that induces the desired reactivity or selectivity is a trivial task. Nonetheless, ligand engineering or selection for any new prochiral olefin remains an empirical trial-error exercise. In this study, we investigated whether machine learning techniques could be used to accelerate the identification of the most efficient chiral ligand. For this purpose, we used high throughput experimentation to build a large dataset consisting of results for Rh-catalyzed asymmetric olefin hydrogenation, specially designed for applications in machine learning. We showcased its alignment with existing literature while addressing observed discrepancies. Additionally, a computational framework for the automated and reproducible quantum-chemistry based featurization of catalyst structures was created. Together with less computationally demanding representations, these descriptors were fed into our machine learning pipeline for both out-of-domain and in-domain prediction tasks of selectivity and reactivity. For out-of-domain purposes, our models provided limited efficacy. It was found that even the most expensive descriptors do not impart significant meaning to the model predictions. The in-domain application, while partly successful for predictions of conversion, emphasizes the need for evaluating the cost-benefit ratio of computationally intensive descriptors and for tailored descriptor design. Challenges persist in predicting enantioselectivity, calling for caution in interpreting results from small datasets. Our insights underscore the importance of dataset diversity with broad substrate inclusion and suggest that mechanistic considerations could improve the accuracy of statistical models.

This journal is © The Royal Society of Chemistry.

PubMed Disclaimer

Conflict of interest statement

There are no conflicts to declare.

Figures

**Fig. 1. Asymmetric hydrogenation reaction performed in this study. A set of varying substrates was selected to be tested with a wide range of Rh-based catalysts under varying conditions.**

Fig. 2. The influence of various conditions on reactivity (conversion) and enantioselectivity (ee) in Rh-catalyzed asymmetric olefin hydrogenation. Solvent effect was evaluated on SM1–SM3 after 1 h reaction time. Pressure effect was evaluated on SM1 after 16 h. Temperature effect was evaluated on SM5 after 16 h and on one plate. An interactive version of this figure displaying the ligand structures corresponding to the data points can be found in the ESI (see interactive figure ‘Fig. 2.html’ in the ESI†).

Fig. 3. Consistency analysis: (A) raw distributions of conversion and |ee| in the current study (HTE data) – all data points in Table 1 – and in literature (reaxys + scifinder). (B) Scatter plot comparing the closest enantiomeric excess (|ee|) from literature with our experimental results under identical conditions (same catalyst, starting material, and solvent). (C) Venn diagram of ligand/substrate/solvent triplets divided into triplets with at least one consistent or discrepant reaxys record (green and red set, respectively). The arrow shows 8 triplets for which a consistency with scifinder was found. (D) Comparative analysis of |ee| discrepancies (|Δee| > 0.2) across our data (blue), reaxys (orange), and scifinder (red) for the 15 triplets for which no consistency with reaxys was found. An interactive version of this figure displaying the ligand structures corresponding to the data points can be found in the ESI (see interactive figure ‘Fig. 3.html’ in the ESI†).

Fig. 4. Distribution for conversion (%, on the top) and enantioselectivity (ΔΔG^‡ in kJ mol⁻¹, on the right) in green, yellow, magenta, red and blue representing SM1, SM2, SM3, SM4, and SM5, respectively. The figure includes a Spearman correlation matrix of experimental values for substrate pairs, with the upper triangle showing ΔΔG^‡ and the lower triangle indicating conversion.

Fig. 5. PCA score plot (A) and cross-sections (C–E) based on binning descriptors into three categories: steric, geometric and electronic. Eight bisphosphine ligands are included as example (B). Percent of explained variance (EV) is reported in the axis label.

Fig. 6. Schematic representation of the machine learning workflow. In both fully and partially out-of-domain modeling scenarios, for each target starting material (SM), the model is trained on data from at least two additional SMs in accordance with seven specific cases. The feature matrix, is formed by concatenating descriptors of both catalyst and starting material. In partially out-of-domain modeling, half of the target SM samples are included in the training set. For in-domain tasks, each SM model undergoes training with an 80 : 20 training-test split, focusing solely on catalyst descriptors. We use of random forest for classification (reactivity) and regression (selectivity).

Fig. 7. Performance metrics for out-of-domain and in-domain modeling. Panel A and C display the balanced accuracy and R2 score for out-of-domain modeling (A: Conversion; C: ΔΔG^‡), while Panel B and D illustrate the same for in-domain modeling (B: Conversion; D: ΔΔG^‡). In A and C the starting material's representation is one-hot encoded. Fully out-of-domain results for DFT-based descriptors are represented by red dots. E: Gini feature importance for RF in-domain classifiers trained on DFT-based descriptors to model conversion.

See this image and copyright information in PMC

Cited by

Advances in Gasoline Hydrodesulfurization Catalysts: The Role of Structure-Activity Relationships and Machine Learning Approaches.
Sun H, Chen C, Zhang R, Li Y, Ge S, Cui P. Sun H, et al. ACS Omega. 2025 Jul 18;10(29):31262-31273. doi: 10.1021/acsomega.5c02980. eCollection 2025 Jul 29. ACS Omega. 2025. PMID: 40757365 Free PMC article. Review.
Data-Driven Virtual Screening of Conformational Ensembles of Transition-Metal Complexes.
Finta S, Kalikadien AV, Pidko EA. Finta S, et al. J Chem Theory Comput. 2025 May 27;21(10):5334-5345. doi: 10.1021/acs.jctc.5c00303. Epub 2025 May 9. J Chem Theory Comput. 2025. PMID: 40340435 Free PMC article.

References

1. Horner L. Siegel H. Büthe H. Angew. Chem., Int. Ed. 1968;7:942. doi: 10.1002/anie.196809422. - DOI
1. Knowles W. S. Sabacky M. J. Chem. Commun. 1968:1445–1446. doi: 10.1039/C19680001445. - DOI
1. Knowles W. S. Angew. Chem., Int. Ed. 2002;41:1998–2007. doi: 10.1002/1521-3773(20020617)41:12<1998::AID-ANIE1998>3.0.CO;2-8. - DOI - PubMed
1. Yang H. Yu H. Stolarzewicz I. A. Tang W. Chem. Rev. 2023;123:9397–9446. doi: 10.1021/acs.chemrev.3c00010. - DOI - PubMed
1. Marianov A. N. Jiang Y. Baiker A. Huang J. Chem. Catal. 2023;3:100631. doi: 10.1016/j.checat.2023.100631. - DOI

LinkOut - more resources

Full Text Sources
- PubMed Central
- Royal Society of Chemistry

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Probing machine learning models based on high throughput experimentation data for the discovery of asymmetric hydrogenation catalysts

Affiliations

Probing machine learning models based on high throughput experimentation data for the discovery of asymmetric hydrogenation catalysts

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources