Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 15;16(1):4522.
doi: 10.1038/s41467-025-59812-0.

Towards global reaction feasibility and robustness prediction with high throughput data and bayesian deep learning

Affiliations

Towards global reaction feasibility and robustness prediction with high throughput data and bayesian deep learning

Haowen Zhong et al. Nat Commun. .

Abstract

Predicting organic reaction feasibility and robustness against environmental factors is challenging. We address this issue by integrating high throughput experimentation (HTE) and Bayesian deep learning. Diverging from existing HTE studies focused on niche chemical spaces, in this work, our in-house HTE platform conducted 11,669 distinct acid amine coupling reactions in 156 working hours, yielding the most extensive single HTE dataset at a volumetric scale for industrial delivery. Our Bayesian neural network model achieved a benchmark for prediction accuracy of 89.48% for reaction feasibility. Furthermore, our fine-grained uncertainty disentanglement enables efficient active learning, reducing 80% of data requirements. Additionally, our uncertainty analysis effectively identifies out-of-domain reactions and evaluates reaction robustness or reproducibility against environmental factors for scaling up, offering a practical framework for navigating chemical spaces and designing highly robust industrial processes.

PubMed Disclaimer

Conflict of interest statement

Competing interests: H.Z., Y.L.L., H.S., Y.R.L., R.Z., B.L., Y.Y., S.L., P.W., and X.W. are employed by ChemLex, a company specializing in high-throughput synthesis. Y.H. serves as the Chief Executive Officer of MegaRobo, which provides high-throughput experimentation (HTE) equipment. The remaining authors (K.F., F.M., F.Y., and T.Y.) declare no competing interests.

Figures

Fig. 1
Fig. 1. Overall workflow combining HTE and bayesian deep learning to estimate reaction feasibility and robustness against environmental factors.
Wetlab data is collected using automated HTE, followed by probabilistic modeling using Bayesian neural networks. The uncertainty is disentangled into epistemic uncertainty and aleatoric uncertainty. Epistemic uncertainty originates from insufficient data and is used for further design of experiments (DoE). Aleatoric uncertainty is linked to the intrinsic noise of experimentation and is demonstrated to be an indicator of reaction robustness. We also found extensive HTE exploration enhances the quality of uncertainty estimation.
Fig. 2
Fig. 2. Data analysis of the automated HTE data and substrate down-sampling process.
a Overview of acid-amine condensation reactions executed in this work. 1-Hydroxybenzotriazole (HOBt); Dimethylformamide (DMF); N,N-Diisopropylethylamine (DIEA); 1-Methylimidazole (NMI); O-(7-azabenzotriazol-1-yl)-N,N,N',N'-tetramethyluronium hexafluorophosphate (HATU); Chloro-N,N,N',N'-tetramethylformamidinium hexafluorophosphate (TCFH); benzotriazol-1-yloxytris(dimethylamino)phosphonium hexafluorophosphate (BOP); Bromotripyrrolidinophosphonium hexafluorophosphate (PyBrOP); benzotriazol-1-yloxytripyrrolidinophosphonium hexafluorophosphate (PyBOP); 1-(3-Dimethylaminopropyl)-3-ethylcarbodiimide hydrochloride (EDCI). b Categorical distributions in patent and commercially available datasets for carboxylic acids and amines. The carbon attached to the reaction center (carboxyl or primary amine group) is in a carbocyclic aromatic ring (SingleHomoAromatic), or a carbocyclic aromatic ring system (PolyHomoAromatic), or a single hetero aromatic ring (SingleHeteroAromatic), or a hetero aromatic ring system (PolyHeteroAromatic), or a single aliphatic ring (SingleAliphaticRing), or an aliphatic ring system (PolyAliphaticRing), or an aliphatic chain (AliphaticChain). The additional 2 categories for amines represent whether a secondary amine is cyclic. c, d Analysis after down-sampling includes t-SNE visualizations and Kernel density estimation (KDE) plots of Bottcher complexity for acids and amines. In the t-SNE plots, selected acids (red) or amines (blue) are displayed alongside a random subset of 2000 compounds extracted from patent data (gray). The KDE plots illustrate the probability density functions of the Bottcher molecular complexity for three groups: selected compounds (red), patent compounds (yellow), and purchasable compounds (blue).
Fig. 3
Fig. 3. Wetlab experiments and results.
a The HTE equipment used for our wetlab experiments (Left: side view; Right: top view). b Proportion of accessible chemical space for HTE and literature data. NiCOlit; AZ ELN: AstraZeneca’s Electronic Lab Notebooks; Suzuki HTE; BH HTE: Buchwald-Hartwig HTE. The yield distribution are shown for the HTE dataset in this work and other widely used literature/HTE datasets, along with SciFinder query and Pistachio. c t-SNE visualization of products in our HTE data (in purple) and patent data(in gray). d Surprising reactions with known factors (upper: steric hindrance, lower: partial charge on the nitrogen atom in the amine group) that impede the reaction result in fairly good reaction yields. (Yield: 62.12% (Upper), 73.97% (Lower)).
Fig. 4
Fig. 4. Data split and model performance.
a t-SNE visualization with KDE contours of different data split strategies. In the random split, the training and test sets are closely fused, indicating minimal domain shift and similar structures in both sets. Domain shift increases with substrate novelty in stratified splits ("one substrate unseen” and “both substrates unseen''), presenting more challenging learning tasks for modeling methods. b The receiver operating characteristic (ROC) curve for the random split in our HTE dataset. A larger area under the ROC curve (AUC) indicates better model performance. c The calibration curve for the random split in our HTE dataset. The closer the calibration curve is to the diagonal line, the better the model’s predicted probabilities reflect the actual outcomes. d The wrong-prediction curve for the random split in our HTE dataset. A larger area under this curve indicates a better linkage between uncertainty and identifying challenge samples. eg The active learning performance with different sampling methods under different data split strategies. We start with 100 data entries and incrementally add 100 until reaching 2000, selecting samples based on predictive (predictive_unc_based), aleatoric (aleatoric_unc_based), or epistemic uncertainty (epistemic_unc_based), or randomly (random_selection). e Model performance under random split. f Model performance under stratified split (one substrate unseen). g Model performance under stratified split (both substrates unseen).
Fig. 5
Fig. 5. Repetition experiment results for aleatoric uncertainty.
a The box plot of the yields among the three repetition experiment subsets. Dashed lines denote the 20% yield threshold between positive and negative reactions. b The ICC of three subsets grouped by aleatoric uncertainty. c Cases with high aleatoric uncertainty from the HTE dataset. d KDE of the PDF differences between reactions from the discovery phase and the process phase. e KDE-estimated PDF after down-sampling the data to the same volume as in (d). Blue bars and line: discovery phase; red bars and line: process phase. f Similar reactions from a practical pharmaceutical case with aleatoric uncertainty difference, (Bortezomib (Johnson & Johnson): Approved for Waldenström’s macroglobulinemia; Acute lymphoblastic leukemia; Mantle cell lymphoma; Multiple myeloma).

Similar articles

References

    1. Wender, P. A. & Miller, B. L. Synthesis at the molecular frontier. Nature460, 197–201 (2009). - PMC - PubMed
    1. Raghavan, P. et al. Incorporating synthetic accessibility in drug design: Predicting reaction yields of suzuki cross-couplings by leveraging abbvie’s 15-year parallel library data set. J. Am. Chem. Soc.146, 15070–15084 (2024). - PMC - PubMed
    1. Sen, M., Arguelles, A. J., Stamatis, S. D., García-Muñoz, S. & Kolis, S. An optimization-based model discrimination framework for selecting an appropriate reaction kinetic model structure during early phase pharmaceutical process development. React. Chem. Eng.6, 2092–2103 (2021).
    1. Schwaller, P., Vaucher, A. C., Laino, T. & Reymond, J.-L. Prediction of chemical reaction yields using deep learning. Mach. Learn. Sci. Technol.2, 015016 (2021).
    1. Schleinitz, J. et al. Machine learning yield prediction from nicolit, a small-size literature data set of nickel catalyzed c–o couplings. J. Am. Chem. Soc.144, 14722–14730 (2022). - PubMed

LinkOut - more resources