Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Dec 24;9(1):779.
doi: 10.1038/s41597-022-01870-w.

Transition1x - a dataset for building generalizable reactive machine learning potentials

Affiliations

Transition1x - a dataset for building generalizable reactive machine learning potentials

Mathias Schreiner et al. Sci Data. .

Abstract

Machine Learning (ML) models have, in contrast to their usefulness in molecular dynamics studies, had limited success as surrogate potentials for reaction barrier search. This is primarily because available datasets for training ML models on small molecular systems almost exclusively contain configurations at or near equilibrium. In this work, we present the dataset Transition1x containing 9.6 million Density Functional Theory (DFT) calculations of forces and energies of molecular configurations on and around reaction pathways at the ωB97x/6-31 G(d) level of theory. The data was generated by running Nudged Elastic Band (NEB) with DFT on 10k organic reactions of various types while saving intermediate calculations. We train equivariant graph message-passing neural network models on Transition1x and cross-validate on the popular ANI1x and QM9 datasets. We show that ML models cannot learn features in transition state regions solely by training on hitherto popular benchmark datasets. Transition1x is a new challenging benchmark that will provide an important step towards developing next-generation ML force fields that also work far away from equilibrium configurations and reactive systems.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Overview of the data generation workflow. First, reactant and product are relaxed before generating an initial MEP guess with IDPP. Next NEB and CINEB is run on the initial path until convergence. If the MEP does not converge within 500 iterations we discard the reaction, as unphysical configurations may have been encountered. If the reaction converges, all intermediate paths are saved in the dataset, as long as they are sufficiently different from previously saved paths.
Fig. 2
Fig. 2
Structure of Transition1x HDF5 file. Parent groups are data/train/val/test. The data group contains all configurations in the set, and the train/val/test groups contain symlinks to the suggested data splits used in this paper. Each split has a set of chemical formulas unique to that split, and each formula contains all reactions with the given atoms. Finally, energy and force calculations can be accessed from the reaction groups for all intermediate configurations, including transition state, product, and reactant.
Fig. 3
Fig. 3
Structure of QM9x HDF5 file. Energy and force calculations for all configurations in the QM9x dataset consisting of a certain combination of atoms can be accessed as datasets through the formula group.
Fig. 4
Fig. 4
Plot of NEB convergence on example reaction. Panel (a) displays the final MEP with reactant, transition state, and product plotted on top with H, C, and O in white, black, and red, respectively. On the x-axis; the reaction coordinate - distance along the reaction path in configurational space, measured in Å. On the y-axis; the difference in potential energy between reactant and current configuration. Panel (b) displays the convergence of NEB. On the x-axis; iterations of NEB. On the y-axis; force in eVÅ−1 and energy barrier in eV at the current step. Fmax, shown in red, is the maximal perpendicular force acting on any geometry along the path, and Barrier, shown in blue, is the height of the energy barrier found at the current step. Moving right in the plot both Fmax converges towards zero as NEB finds the saddle point, and the Barrier converges towards the final value of 3.6 eV that can be seen in panel a.
Fig. 5
Fig. 5
Comparison of transition states and barriers found in this work with neb and the 6–31 G(d) basis set, and in the original work with gsm and the def2-mSVP basis set. Panel (a) displays energies in eV for all transition states calculated using neb on the x-axis and gsm on the y-axis. Panel (b) displays a histogram of Root Mean Square Deviation (RMSD) between transition states found.
Fig. 6
Fig. 6
Distribution of forces acting on atom-types in each dataset. The x-axis is the force measured in eV/Å. The y-axis is the base 10 logarithm of the count of forces in each bin, normalized over the full domain so that all sets can be compared. In blue; Transition1x. In yellow; ANI1x. In green QM9x.
Fig. 7
Fig. 7
Distribution of interatomic distances between heavy atoms in each dataset. A configuration with n heavy atoms contributes with n(n−1)/2 distances in the count. On the y-axis; the log frequency of interatomic distance, normalized between 0 and 1 for comparison as datasets vary in size. On the x-axis; distance given in units of r0 where r0 is the equilibrium bond length for a single bond between the smallest possible stable molecule that can be made with the atoms in question. In blue; Transition1x. In yellow; ANI1x. In green; QM9x, recalculated using our level of theory.

References

    1. Faber FA, et al. Prediction errors of molecular machine learning models lower than hybrid dft error. Journal of Chemical Theory and Computation. 2017;13:5255–5264. doi: 10.1021/ACS.JCTC.7B00577/SUPPL_FILE/CT7B00577_SI_001.PDF. - DOI - PubMed
    1. Westermayr J, Gastegger M, Schütt KT, Maurer RJ. Perspective on integrating machine learning into computational chemistry and materials science. The Journal of Chemical Physics. 2021;154:230903. doi: 10.1063/5.0047760. - DOI - PubMed
    1. Campbell SI, Allan DB, Barbour AM. Machine learning for the solution of the schrödinger equation. Machine Learning: Science and Technology. 2020;1:013002. doi: 10.1088/2632-2153/AB7D30. - DOI
    1. Behler J, Parrinello M. Generalized neural-network representation of high-dimensional potential-energy surfaces. Physical review letters. 2007;98:146401. doi: 10.1103/PhysRevLett.98.146401. - DOI - PubMed
    1. Westermayr J, Marquetand P. Machine learning for electronically excited states of molecules. Chemical Reviews. 2021;121:9873–9926. doi: 10.1021/ACS.CHEMREV.0C00749. - DOI - PMC - PubMed

LinkOut - more resources