Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 16;15(34):13618-13630.
doi: 10.1039/d4sc03647f. eCollection 2024 Aug 28.

Probing machine learning models based on high throughput experimentation data for the discovery of asymmetric hydrogenation catalysts

Affiliations

Probing machine learning models based on high throughput experimentation data for the discovery of asymmetric hydrogenation catalysts

Adarsh V Kalikadien et al. Chem Sci. .

Abstract

Enantioselective hydrogenation of olefins by Rh-based chiral catalysts has been extensively studied for more than 50 years. Naively, one would expect that everything about this transformation is known and that selecting a catalyst that induces the desired reactivity or selectivity is a trivial task. Nonetheless, ligand engineering or selection for any new prochiral olefin remains an empirical trial-error exercise. In this study, we investigated whether machine learning techniques could be used to accelerate the identification of the most efficient chiral ligand. For this purpose, we used high throughput experimentation to build a large dataset consisting of results for Rh-catalyzed asymmetric olefin hydrogenation, specially designed for applications in machine learning. We showcased its alignment with existing literature while addressing observed discrepancies. Additionally, a computational framework for the automated and reproducible quantum-chemistry based featurization of catalyst structures was created. Together with less computationally demanding representations, these descriptors were fed into our machine learning pipeline for both out-of-domain and in-domain prediction tasks of selectivity and reactivity. For out-of-domain purposes, our models provided limited efficacy. It was found that even the most expensive descriptors do not impart significant meaning to the model predictions. The in-domain application, while partly successful for predictions of conversion, emphasizes the need for evaluating the cost-benefit ratio of computationally intensive descriptors and for tailored descriptor design. Challenges persist in predicting enantioselectivity, calling for caution in interpreting results from small datasets. Our insights underscore the importance of dataset diversity with broad substrate inclusion and suggest that mechanistic considerations could improve the accuracy of statistical models.

PubMed Disclaimer

Conflict of interest statement

There are no conflicts to declare.

Figures

Fig. 1
Fig. 1. Asymmetric hydrogenation reaction performed in this study. A set of varying substrates was selected to be tested with a wide range of Rh-based catalysts under varying conditions.
Fig. 2
Fig. 2. The influence of various conditions on reactivity (conversion) and enantioselectivity (ee) in Rh-catalyzed asymmetric olefin hydrogenation. Solvent effect was evaluated on SM1–SM3 after 1 h reaction time. Pressure effect was evaluated on SM1 after 16 h. Temperature effect was evaluated on SM5 after 16 h and on one plate. An interactive version of this figure displaying the ligand structures corresponding to the data points can be found in the ESI (see interactive figure ‘Fig. 2.html’ in the ESI†).
Fig. 3
Fig. 3. Consistency analysis: (A) raw distributions of conversion and |ee| in the current study (HTE data) – all data points in Table 1 – and in literature (reaxys + scifinder). (B) Scatter plot comparing the closest enantiomeric excess (|ee|) from literature with our experimental results under identical conditions (same catalyst, starting material, and solvent). (C) Venn diagram of ligand/substrate/solvent triplets divided into triplets with at least one consistent or discrepant reaxys record (green and red set, respectively). The arrow shows 8 triplets for which a consistency with scifinder was found. (D) Comparative analysis of |ee| discrepancies (|Δee| > 0.2) across our data (blue), reaxys (orange), and scifinder (red) for the 15 triplets for which no consistency with reaxys was found. An interactive version of this figure displaying the ligand structures corresponding to the data points can be found in the ESI (see interactive figure ‘Fig. 3.html’ in the ESI†).
Fig. 4
Fig. 4. Distribution for conversion (%, on the top) and enantioselectivity (ΔΔG in kJ mol−1, on the right) in green, yellow, magenta, red and blue representing SM1, SM2, SM3, SM4, and SM5, respectively. The figure includes a Spearman correlation matrix of experimental values for substrate pairs, with the upper triangle showing ΔΔG and the lower triangle indicating conversion.
Fig. 5
Fig. 5. PCA score plot (A) and cross-sections (C–E) based on binning descriptors into three categories: steric, geometric and electronic. Eight bisphosphine ligands are included as example (B). Percent of explained variance (EV) is reported in the axis label.
Fig. 6
Fig. 6. Schematic representation of the machine learning workflow. In both fully and partially out-of-domain modeling scenarios, for each target starting material (SM), the model is trained on data from at least two additional SMs in accordance with seven specific cases. The feature matrix, is formed by concatenating descriptors of both catalyst and starting material. In partially out-of-domain modeling, half of the target SM samples are included in the training set. For in-domain tasks, each SM model undergoes training with an 80 : 20 training-test split, focusing solely on catalyst descriptors. We use of random forest for classification (reactivity) and regression (selectivity).
Fig. 7
Fig. 7. Performance metrics for out-of-domain and in-domain modeling. Panel A and C display the balanced accuracy and R2 score for out-of-domain modeling (A: Conversion; C: ΔΔG), while Panel B and D illustrate the same for in-domain modeling (B: Conversion; D: ΔΔG). In A and C the starting material's representation is one-hot encoded. Fully out-of-domain results for DFT-based descriptors are represented by red dots. E: Gini feature importance for RF in-domain classifiers trained on DFT-based descriptors to model conversion.

Similar articles

Cited by

References

    1. Horner L. Siegel H. Büthe H. Angew. Chem., Int. Ed. 1968;7:942. doi: 10.1002/anie.196809422. - DOI
    1. Knowles W. S. Sabacky M. J. Chem. Commun. 1968:1445–1446. doi: 10.1039/C19680001445. - DOI
    1. Knowles W. S. Angew. Chem., Int. Ed. 2002;41:1998–2007. doi: 10.1002/1521-3773(20020617)41:12<1998::AID-ANIE1998>3.0.CO;2-8. - DOI - PubMed
    1. Yang H. Yu H. Stolarzewicz I. A. Tang W. Chem. Rev. 2023;123:9397–9446. doi: 10.1021/acs.chemrev.3c00010. - DOI - PubMed
    1. Marianov A. N. Jiang Y. Baiker A. Huang J. Chem. Catal. 2023;3:100631. doi: 10.1016/j.checat.2023.100631. - DOI

LinkOut - more resources