Negative chemical data boosts language models in reaction outcome prediction
- PMID: 40512839
- PMCID: PMC12164950
- DOI: 10.1126/sciadv.adt5578
Negative chemical data boosts language models in reaction outcome prediction
Abstract
Trial-and-error approaches in chemistry generate abundant unsuccessful experiments, yet the potential of these so-called negative results remains largely underutilized. Here, we demonstrate that information from negative chemical reactions can be leveraged to improve reactivity-prediction models, offering advantages in scenarios with a limited volume of successful data. We extend the tuning of language models with reinforcement learning to the chemistry domain, training a transformer model for chemical reaction prediction. Our approach is evaluated using both a rigorously controlled dataset and a realistic high-throughput dataset comprising extensive reaction screenings across diverse catalysts sets and experimental conditions. The model achieves state-of-the-art performance by leveraging information from as few as 20 positive data points in the controlled dataset, supported by a negative dataset at least 40 times larger. Consistent results on both datasets demonstrate that, with an appropriate optimization strategy and the inclusion of unsuccessful experimental data, models can be effectively trained even when successful reactions are underrepresented.
Figures




Similar articles
-
Comparing Pre-trained and Feature-Based Models for Prediction of Alzheimer's Disease Based on Speech.Front Aging Neurosci. 2021 Apr 27;13:635945. doi: 10.3389/fnagi.2021.635945. eCollection 2021. Front Aging Neurosci. 2021. PMID: 33986655 Free PMC article.
-
Transfer Learning: Making Retrosynthetic Predictions Based on a Small Chemical Reaction Dataset Scale to a New Level.Molecules. 2020 May 19;25(10):2357. doi: 10.3390/molecules25102357. Molecules. 2020. PMID: 32438572 Free PMC article.
-
Macromolecular crowding: chemistry and physics meet biology (Ascona, Switzerland, 10-14 June 2012).Phys Biol. 2013 Aug;10(4):040301. doi: 10.1088/1478-3975/10/4/040301. Epub 2013 Aug 2. Phys Biol. 2013. PMID: 23912807
-
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217. Cochrane Database Syst Rev. 2022. PMID: 36321557 Free PMC article.
-
Application of Transformers to Chemical Synthesis.Molecules. 2025 Jan 23;30(3):493. doi: 10.3390/molecules30030493. Molecules. 2025. PMID: 39942600 Free PMC article. Review.
References
-
- Maloney M. O., Coley C. W., Genheden S., Carson N., Helquist P., Norrby P.-O., Wiest O., Negative data in data sets for machine learning training. Org. Lett. 25, 2945–2947 (2023). - PubMed
-
- Raccuglia P., Elbert K., Adler P. D. F., Falk C., Wenny M. B., Mollo A., Zeller M., Friedler S. A., Schrier J., Norquist A., Machine-learning-assisted materials discovery using failed experiments. Nature 533, 73–76 (2016). - PubMed
-
- Angello N. H., Rathore V., Beker W., Wołos A., Jira E. R., Roszak R., Wu T. C., Schroeder C. M., Aspuru-Guzik A., Grzybowski B. A., Burke M. D., Closed-loop optimization of general reaction conditions for heteroaryl Suzuki-Miyaura coupling. Science 378, 399–405 (2022). - PubMed
-
- Buitrago Santanilla A., Regalado E. L., Pereira T., Shevlin M., Bateman K., Campeau L. C., Schneeweis J., Berritt S., Shi Z., Nantermet P., Liu Y., Helmy R., Welch C. J., Vachal P., Davies J. W., Cernak T., Dreher S. D., Nanomole-scale high-throughput chemistry for the synthesis of complex molecules. Science 347, 49–53 (2015). - PubMed
LinkOut - more resources
Full Text Sources