Predicting success in Cu-catalyzed C-N coupling reactions using data science

Mohammad H Samha¹, Lucas J Karas¹, David B Vogt¹, Emmanuel C Odogwu¹, Jennifer Elward², Jennifer M Crawford³, Janelle E Steves³, Matthew S Sigman¹

Affiliations

¹ Department of Chemistry, University of Utah, 315 S. 1400 E., Salt Lake City, UT 84112, USA.
² Molecular Design, GlaxoSmithKline, 1250 S. Collegeville Rd., Collegeville, PA 19426, USA.
³ Drug Substance Development, GlaxoSmithKline, 1250 S. Collegeville Rd., Collegeville, PA 19426, USA.

PMID: 38232169
PMCID: PMC10793951
DOI: 10.1126/sciadv.adn3478

Predicting success in Cu-catalyzed C-N coupling reactions using data science

Mohammad H Samha et al. Sci Adv. 2024.

. 2024 Jan 19;10(3):eadn3478.

doi: 10.1126/sciadv.adn3478. Epub 2024 Jan 17.

Authors

Mohammad H Samha¹, Lucas J Karas¹, David B Vogt¹, Emmanuel C Odogwu¹, Jennifer Elward², Jennifer M Crawford³, Janelle E Steves³, Matthew S Sigman¹

Affiliations

¹ Department of Chemistry, University of Utah, 315 S. 1400 E., Salt Lake City, UT 84112, USA.
² Molecular Design, GlaxoSmithKline, 1250 S. Collegeville Rd., Collegeville, PA 19426, USA.
³ Drug Substance Development, GlaxoSmithKline, 1250 S. Collegeville Rd., Collegeville, PA 19426, USA.

PMID: 38232169
PMCID: PMC10793951
DOI: 10.1126/sciadv.adn3478

Abstract

Data science is assuming a pivotal role in guiding reaction optimization and streamlining experimental workloads in the evolving landscape of synthetic chemistry. A discipline-wide goal is the development of workflows that integrate computational chemistry and data science tools with high-throughput experimentation as it provides experimentalists the ability to maximize success in expensive synthetic campaigns. Here, we report an end-to-end data-driven process to effectively predict how structural features of coupling partners and ligands affect Cu-catalyzed C-N coupling reactions. The established workflow underscores the limitations posed by substrates and ligands while also providing a systematic ligand prediction tool that uses probability to assess when a ligand will be successful. This platform is strategically designed to confront the intrinsic unpredictability frequently encountered in synthetic reaction deployment.

PubMed Disclaimer

Figures

**Fig. 1.. General modeling workflow, application of C–N couplings in drug development, and reaction conditions for this work.**
(A) General workflow for modeling chemical reactions: Dataset design initiated with a database query to create a library of commercially available substrates and ligands. These molecules are then parameterized using quantum-chemical calculations and clustered by similarity using dimensionality reduction and unsupervised ML techniques. Subsequently, molecules from each cluster are selected to form a diverse and representative substrate space for training the ML model. The training sets can be refined using active learning strategies that use classification models to identify substrate and ligand features responsible for the activity. These insights, in turn, guide the selection of additional substrates and/or ligands. The reaction output resulting from the final combination of substrates and ligands is then used to train a predictive ML model. (B) Synthetic applications of C–N cross-couplings in drug development. Black and blue atoms and bonds represent moieties originally from the aryl bromide and primary amine, respectively. Gray atoms and bonds indicate chemical transformations made after the C–N cross-coupling. (C) Reaction conditions used in this work and illustration of the tool’s utility, including confidence values on the prediction and top-suggested ligands for testing.

**Fig. 2.. Iterative refinement of the training set.**
(A) Initial training set yields and yield distribution revealing a notable bias towards off reactivity (yield, <20%). (B) Analysis of the initial training set with the Cu–L computed interaction distance (d). The analysis achieves an accuracy of 0.96 and an F1 score of 0.95. The prediction accuracy for the newly selected ligands is 100%. The dotted lines indicate the 20% yield threshold for on:off reactivity and the 2.07-Å threshold for the computed d, signifying the importance of d in reaction yields. Red and blue dots correspond to ligands that either fall below or exceed the 20% yield threshold in the ligand training set, respectively. Newly chosen ligands are marked with gray crosses. Selected ligands are depicted with dark contours. (C) Representation of selected products. Newly selected ligand yields and yield distribution demonstrating a balanced on:off ratio.

**Fig. 3.. Decision tree classification model and external validation process.**
(A) Decision tree classification model and its accuracy in classifying ligand-substrate combinations for Ullmann C–N couplings into either on or off reactivity; graphic representation of the molecular features present at the decision nodes; and proposed catalytic cycle highlighting key mechanistic steps of Ullmann C–N couplings. (B) External validation process of the decision tree classification model using substrates that were not seen during the training phase. The classification results are indicated by the colors blue (representing on-classification) and red (representing off-classification). The training products are depicted as circles, while the validation products are denoted by crosses. Selected validation products are highlighted with dark contours.

**Fig. 4.. Estimation of confidence in the prediction value and confidence map.**
(A) Example of prediction confidence estimation using product **P123**: Out of the 18 ligands tested, 16 ligands yielded results as predicted (<20%), resulting in an estimated prediction confidence of 85%. (B) Prediction confidence map constructed using estimated confidence values for each product within this study (encompassing both training and validation sets) and the radial basis function interpolation method. Areas shaded in darker blue indicate higher confidence in predicting “on” reactivity, whereas darker red denotes greater confidence in predicting “off” reactivity. The targeted product **P123** is highlighted with a white contour, while its two nearest neighbors are highlighted with black contours. The recommended ligands are marked (*) and **L28** appears in both neighbors (**).

**Fig. 5.. Synthetic applications.**
(A) Prediction of a favorable and (B) challenging combination of substrates toward synthetically relevant products through Ullmann C–N couplings. The target compounds are indicated by white outlines, while the two nearest neighbors are highlighted with bold black outlines. The recommended ligands are marked (*).

See this image and copyright information in PMC

References

1. Collins K. D., Gensch T., Glorius F., Contemporary screening approaches to reaction discovery and development. Nat. Chem. 6, 859–871 (2014). - PubMed
1. Santanilla A. B., Regalado E. L., Pereira T., Shevlin M., Bateman K., Campeau L. C., Schneeweis J., Berritt S., Shi Z. C., Nantermet P., Liu Y., Helmy R., Welch C. J., Vachal P., Davies I. W., Cernak T., Dreher S. D., Nanomole-scale high-throughput chemistry for the synthesis of complex molecules. Science 347, 49–53 (2015). - PubMed
1. Ahneman D. T., Estrada J. G., Lin S., Dreher S. D., Doyle A. G., Predicting reaction performance in C–N cross-coupling using machine learning. Science 360, 186–190 (2018). - PubMed
1. Rinehart N. I., Saunthwal R. K., Wellauer J., Zahrt A. F., Schlemper L., Shved A. S., Bigler R., Fantasia S., Denmark S. E., A machine-learning tool to predict substrate-adaptive conditions for Pd-catalyzed C–N couplings. Science 381, 965–972 (2023). - PubMed
1. Sigman M. S., Jacobsen E. N., Schiff base catalysts for the asymmetric Strecker reaction identified and optimized from parallel synthetic libraries. J. Am. Chem. Soc. 120, 4901–4902 (1998).

Grants and funding

R35 GM136271/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Predicting success in Cu-catalyzed C-N coupling reactions using data science

Affiliations

Predicting success in Cu-catalyzed C-N coupling reactions using data science

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources