Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jun 13;11(24):eadt5578.
doi: 10.1126/sciadv.adt5578. Epub 2025 Jun 13.

Negative chemical data boosts language models in reaction outcome prediction

Affiliations

Negative chemical data boosts language models in reaction outcome prediction

Alessandra Toniato et al. Sci Adv. .

Abstract

Trial-and-error approaches in chemistry generate abundant unsuccessful experiments, yet the potential of these so-called negative results remains largely underutilized. Here, we demonstrate that information from negative chemical reactions can be leveraged to improve reactivity-prediction models, offering advantages in scenarios with a limited volume of successful data. We extend the tuning of language models with reinforcement learning to the chemistry domain, training a transformer model for chemical reaction prediction. Our approach is evaluated using both a rigorously controlled dataset and a realistic high-throughput dataset comprising extensive reaction screenings across diverse catalysts sets and experimental conditions. The model achieves state-of-the-art performance by leveraging information from as few as 20 positive data points in the controlled dataset, supported by a negative dataset at least 40 times larger. Consistent results on both datasets demonstrate that, with an appropriate optimization strategy and the inclusion of unsuccessful experimental data, models can be effectively trained even when successful reactions are underrepresented.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.. Characterization of negative data.
Left: RegioSQM20 (38). Right: Data from (5). The molecular differences between the positive and negative are highlighted in gray.
Fig. 2.
Fig. 2.. RL and FT performances in low data regimes.
Frequency of correctly predicted positive reactions by the RL and the FT models trained on hundreds of failed reactions and 20 positive reactions from the RegioSQM20 dataset (i.e., the Klow dataset subset). The shaded regions represent the SD across three random splits. The magenta dotted line represents the highest performance reached by FT on the Khigh dataset, where all the positive reactions from RegioSQM20 were used during training.
Fig. 3.
Fig. 3.. Illustration of positive and negative embedding vectors for the bromination reaction of 5-(2-bromophenyl)isoxazole with N-bromosuccinimide.
In the case of the base model (top left), negative reaction outcomes are tightly clustered to positive reaction outcomes, whereas for the classification-tuned model (bottom left), negatives are cast further apart from the positive. Blue points are the rest of both RegioSQM and USPTO. A is the correct product of the bromination reaction. B, C, D, E, and F are negative products.
Fig. 4.
Fig. 4.. Breakdown of RL performance.
Positive accuracy of RL trained on data splits obtained from five starting seeds for Khigh (left) and Klow (right).

Similar articles

References

    1. Maloney M. O., Coley C. W., Genheden S., Carson N., Helquist P., Norrby P.-O., Wiest O., Negative data in data sets for machine learning training. Org. Lett. 25, 2945–2947 (2023). - PubMed
    1. Raccuglia P., Elbert K., Adler P. D. F., Falk C., Wenny M. B., Mollo A., Zeller M., Friedler S. A., Schrier J., Norquist A., Machine-learning-assisted materials discovery using failed experiments. Nature 533, 73–76 (2016). - PubMed
    1. Angello N. H., Rathore V., Beker W., Wołos A., Jira E. R., Roszak R., Wu T. C., Schroeder C. M., Aspuru-Guzik A., Grzybowski B. A., Burke M. D., Closed-loop optimization of general reaction conditions for heteroaryl Suzuki-Miyaura coupling. Science 378, 399–405 (2022). - PubMed
    1. King-Smith E., Berritt S., Bernier L., Hou X., Klug-McLeod J. L., Mustakis J., Sach N. W., Tucker J. W., Yang Q., Howard R. M., Lee A. A., Probing the chemical ‘reactome’ with high-throughput experimentation data. Nat. Chem. 16, 633–643 (2024). - PMC - PubMed
    1. Buitrago Santanilla A., Regalado E. L., Pereira T., Shevlin M., Bateman K., Campeau L. C., Schneeweis J., Berritt S., Shi Z., Nantermet P., Liu Y., Helmy R., Welch C. J., Vachal P., Davies J. W., Cernak T., Dreher S. D., Nanomole-scale high-throughput chemistry for the synthesis of complex molecules. Science 347, 49–53 (2015). - PubMed

LinkOut - more resources