Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Oct 30;11(1):5505.
doi: 10.1038/s41467-020-19267-x.

Machine learning in chemical reaction space

Affiliations

Machine learning in chemical reaction space

Sina Stocker et al. Nat Commun. .

Abstract

Chemical compound space refers to the vast set of all possible chemical compounds, estimated to contain 1060 molecules. While intractable as a whole, modern machine learning (ML) is increasingly capable of accurately predicting molecular properties in important subsets. Here, we therefore engage in the ML-driven study of even larger reaction space. Central to chemistry as a science of transformations, this space contains all possible chemical reactions. As an important basis for 'reactive' ML, we establish a first-principles database (Rad-6) containing closed and open-shell organic molecules, along with an associated database of chemical reaction energies (Rad-6-RE). We show that the special topology of reaction spaces, with central hub molecules involved in multiple reactions, requires a modification of existing compound space ML-concepts. Showcased by the application to methane combustion, we demonstrate that the learned reaction energies offer a non-empirical route to rationally extract reduced reaction networks for detailed microkinetic analyses.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Visualization of chemical reaction spaces as graphs with molecules as nodes and reactions as edges.
a Full network of bond dissociation reactions for carbon-, oxygen-, hydrogen-containing molecules with up to four heavy atoms. b Reduced reaction network of the initial steps of natural gas combustion. Nodes are colored according to the number of incident edges/reactions (their degree) from low (white) to high (dark green).
Fig. 2
Fig. 2. The Rad-6 database.
a Number of molecules in the database, according to their number of non-hydrogen atoms. b Structures of representative molecules in the database. Dots indicate radicals and respective SMILES strings are listed.
Fig. 3
Fig. 3. Visualizing Rad-6 with Kernel Principal Component Analysis (kPCA).
a kPCA based on an intensive kernel. b kPCA based on an extensive kernel. Points are colored according to the DFT atomization energy per atom in (a) and total atomization energy in (b). The arrows provide a qualitative interpretation of the principal component (PC) axes and small black dots indicate the FPS-selected training configurations for a ML model with 1000 training molecules and using the corresponding distance criterion (Dint (a), Dext (b)), see text.
Fig. 4
Fig. 4. Learning curves for atomization energies (AE).
a Mean absolute error (MAE) of AE predictions on the test set, as a function of the number of training molecules ntrain. The training sets were constructed using FPS with the extensive (a) and intensive kernels (b) (see text). c AE learning curves using molecular geometries obtained with the universal forcefield (UFF). The gray line represents a learning rate of ntrain0.65 and serves as a guide to the eye in all three panels.
Fig. 5
Fig. 5. Illustration of the Rad-6 chemical space as an interpolated height profile.
a kPCA as in Fig. 3 showing the DFT reference intensive atomization energies AE/N (in eV). b Prediction from the ML model using an intensive kernel and a small intensively selected training set of only 1000 molecules with UFF geometries. c Respective differences (DFT-ML). Here, the range of the colorbar is shifted but the scale is the same.
Fig. 6
Fig. 6. Correlation of mean absolute errors (MAE) for AE and RE prediction.
a Correlation plot for the extensive FPS training set using the extensive and intensive kernels and DFT geometries. b Correlation plot for the intensive FPS training set using the extensive and intensive kernels and DFT geometries. Multiple points for each model represent the different training set sizes shown in Fig. 4 (indicated in (b)), with smaller AE errors corresponding to larger training sets.
Fig. 7
Fig. 7. ML-based exploration of a complex reaction network.
Each frame shows the reduced reaction network extracted from a microkinetic simulation of methane combustion at different stages in simulation time. The abstract simulation time is shown for each frame in arbitrary units, see text. Educts and products (in bold), as well as important intermediates are highlighted. Nodes are colored according to their absolute atomization energies from low (red) to high (blue). Cyclic compounds are marked with an asterisk, to distinguish them from the corresponding linear compounds.
Fig. 8
Fig. 8. Comparison ML models trained on the Rad-6 and the Rad-6-BS databases.
a Learning curves for AE predictions of using the extensive kernel with an extensive FPS split and DFT geometries. b Same as (a) but for the intensive kernel with an intensive FPS split. c Correlation plot of MAE RE vs MAE AE for both Rad-6 and Rad-6-BS. Blue lines represent results obtained with the extensive kernel (crosses for Rad-6 and stars for Rad-6-BS) in (a) and (c). Red circles correspond to the intensive kernel with Rad-6 and orange diamonds to the intensive kernel with Rad-6-RE.

References

    1. Ulissi ZW, Medford AJ, Bligaard T, Nørskov JK. To address surface reaction network complexity using scaling relations machine learning and DFT calculations. Nat. Commun. 2017;8:14621. doi: 10.1038/ncomms14621. - DOI - PMC - PubMed
    1. Gossler H, Maier L, Angeli S, Tischer S, Deutschmann O. CaRMeN: an improved computer-aided method for developing catalytic reaction mechanisms. Catalysts. 2019;9:227. doi: 10.3390/catal9030227. - DOI
    1. Zhu H, Kee RJ, Janardhanan VM, Deutschmann O, Goodwin DG. Modeling elementary heterogeneous chemistry and electrochemistry in solid-oxide fuel cells. J. Electrochem. Soc. 2005;152:A2427. doi: 10.1149/1.2116607. - DOI
    1. Deutschmann O, Schmidt LD. Modeling the partial oxidation of methane in a short-contact-time reactor. AIChE J. 1998;44:2465–2477. doi: 10.1002/aic.690441114. - DOI
    1. Harper MR, Geem KMV, Pyl SP, Marin GB, Green WH. Comprehensive reaction mechanism for n-butanol pyrolysis and combustion. Combust. Flame. 2011;158:16–41. doi: 10.1016/j.combustflame.2010.06.002. - DOI

Publication types