Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Jul 24;63(14):4253-4265.
doi: 10.1021/acs.jcim.3c00607. Epub 2023 Jul 5.

Data Sharing in Chemistry: Lessons Learned and a Case for Mandating Structured Reaction Data

Affiliations
Review

Data Sharing in Chemistry: Lessons Learned and a Case for Mandating Structured Reaction Data

Rocío Mercado et al. J Chem Inf Model. .

Abstract

The past decade has seen a number of impressive developments in predictive chemistry and reaction informatics driven by machine learning applications to computer-aided synthesis planning. While many of these developments have been made even with relatively small, bespoke data sets, in order to advance the role of AI in the field at scale, there must be significant improvements in the reporting of reaction data. Currently, the majority of publicly available data is reported in an unstructured format and heavily imbalanced toward high-yielding reactions, which influences the types of models that can be successfully trained. In this Perspective, we analyze several data curation and sharing initiatives that have seen success in chemistry and molecular biology. We discuss several factors that have contributed to their success and how we can take lessons from these case studies and apply them to reaction data. Finally, we spotlight the Open Reaction Database and summarize key actions the community can take toward making reaction data more findable, accessible, interoperable, and reusable (FAIR), including the use of mandates from funding agencies and publishers.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

Figure 1
Figure 1
(top) Timeline of key dates surrounding the databases discussed in this work. (bottom) The growth of each database over time, excluding the ORD. Count is the exact number of entries according to each database (sources: CSD, PDB, PubChem, and ChEMBL). Traces do not necessarily start close to 0 due to limited public information for early dates of some efforts.
Figure 2
Figure 2
Timeline illustrating the growth in contributors over time for each database. Sources for the data are given in Figure 1. PubChem data on the individual number of contributors over time was not available; thus “sources” (i.e., organizations) are plotted instead. As ChEMBL does not follow a contributor model but an expert curation model, its growth in data sources is plotted instead. Finally, note that, while the CSD follows a contributor model, the submission process also includes manual review by domain experts at the CCDC. Traces do not necessarily start close to 0 due to limited public information for early dates of some efforts.
Figure 3
Figure 3
Four methods for obtaining structured reaction information: (top left) mining historical unstructured data, (top right) manually structuring and translating present/historical data via electronic lab notebooks, (bottom left) efforts to publish existing structured data centrally and publicly, and (bottom right) moving forward, building best practices in from the beginning, whether running benchtop or high-throughput experiments. Regardless of the approach, the ORD can provide a framework for depositing, validating, and distributing structured reaction data. Icons downloaded from flaticon.com.

References

    1. Coley C. W.; Barzilay R.; Jaakkola T. S.; Green W. H.; Jensen K. F. Prediction of organic reaction outcomes using machine learning. ACS Cent. Sci. 2017, 3, 434–443. 10.1021/acscentsci.7b00064. - DOI - PMC - PubMed
    1. Schwaller P.; Laino T.; Gaudin T.; Bolgar P.; Hunter C. A.; Bekas C.; Lee A. A. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 2019, 5, 1572–1583. 10.1021/acscentsci.9b00576. - DOI - PMC - PubMed
    1. Thakkar A.; Selmi N.; Reymond J.-L.; Engkvist O.; Bjerrum E. J. Ring breaker”: neural network driven synthesis prediction of the ring system chemical space. J. Med. Chem. 2020, 63, 8791–8808. 10.1021/acs.jmedchem.9b01919. - DOI - PubMed
    1. Irwin R.; Dimitriadis S.; He J.; Bjerrum E. J. Chemformer: a pre-trained transformer for computational chemistry. Mach. Learn.: Sci. Technol. 2022, 3, 015022. 10.1088/2632-2153/ac3ffb. - DOI
    1. Seidl P.; Renz P.; Dyubankova N.; Neves P.; Verhoeven J.; Wegner J. K.; Segler M.; Hochreiter S.; Klambauer G. Improving few-and zero-shot reaction template prediction using modern hopfield networks. J. Chem. Inf. Model. 2022, 62, 2111–2120. 10.1021/acs.jcim.1c01065. - DOI - PMC - PubMed

Publication types