Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Aug 25;121(16):10001-10036.
doi: 10.1021/acs.chemrev.0c01303. Epub 2021 Aug 13.

Ab Initio Machine Learning in Chemical Compound Space

Affiliations
Review

Ab Initio Machine Learning in Chemical Compound Space

Bing Huang et al. Chem Rev. .

Abstract

Chemical compound space (CCS), the set of all theoretically conceivable combinations of chemical elements and (meta-)stable geometries that make up matter, is colossal. The first-principles based virtual sampling of this space, for example, in search of novel molecules or materials which exhibit desirable properties, is therefore prohibitive for all but the smallest subsets and simplest properties. We review studies aimed at tackling this challenge using modern machine learning techniques based on (i) synthetic data, typically generated using quantum mechanics based methods, and (ii) model architectures inspired by quantum mechanics. Such Quantum mechanics based Machine Learning (QML) approaches combine the numerical efficiency of statistical surrogate models with an ab initio view on matter. They rigorously reflect the underlying physics in order to reach universality and transferability across CCS. While state-of-the-art approximations to quantum problems impose severe computational bottlenecks, recent QML based developments indicate the possibility of substantial acceleration without sacrificing the predictive power of quantum mechanics.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

Figure 1
Figure 1
A cartoon of similarities among atoms across chemical compound space, not in conflict with quantum mechanics. The exemplary molecule aspirin is highlighted by bonds, and each of its atoms is superimposed with a similar atom in another molecule (hydrogens omitted for clarity). Green, yellow, gray, red, and blue refer to sulfur, phosphor, carbon, oxygen, and nitrogen, respectively. Reproduced with permission from ref (15). Copyright 2020 Springer Nature.
Figure 2
Figure 2
3D projection of high-dimensional kernel representation of chemical compound space. Within kernel ridge regression, chemical compound space corresponds to a complete graph where every compound is represented by a black vertex and black lines correspond to the edges which quantify similarities. Each compound, in return, can be represented by a molecular complete graph (e.g., the Coulomb matrix (CM)) recording the elemental type of each atom and its distances to all other atoms. Given known training data for all compounds shown, a property prediction can be made for any query compound as illustrated by X. Choice of kernel-function, metric, and representation will strongly impact the specific shape of this space and thereby the learning efficiency of the resulting QML model.
Figure 3
Figure 3
Illustration of learning curves: Errors (E) versus training set size (N). Horizontal and vertical thin lines illustrate exemplary target accuracy and available training set size, respectively. For functional ML models, training errors are close to zero (not shown), and prediction errors must decay linearly with N on log–log scales. Black-solid, dotted, dashed and dotted-dashed lines exemplify prediction errors of ML models with incomplete information (ceases to learn for large N due to being parametric, using nonunique representations, or training on noisy data), unique and less physical representation, unique and more physical representation, and explicit account of lowered effective dimensionality (i.e., “compact”), respectively. The solid-pink line corresponds to the training error for a parametric model. Training errors for ML models are negligible for noise-free data.
Figure 4
Figure 4
QML models infer properties for new chemical compositions. DFT and QML (FCHL+KRR) based predictions of covalent triple, double, and single bonding between groups IV and V (left column), VI (mid column), and VII (right column) elements, respectively. Open valencies in the group IV elements have been saturated with hydrogens. QML models were trained on the DFT results for all of those chemical elements that are not present in the query molecule. Reproduced with permission from ref (166). Copyright 2018 licensed under a Creative Commons Attribution (CC BY) license.
Figure 5
Figure 5
All AMONs sizes 1–7 for training system specific QML models of exemplary query molecule 2-(furan-2-yl)propan-2-ol (top right).
Figure 6
Figure 6
Property vs property matrix for ∼7k organic molecules at various levels of theory. A multiproperty neural net trained in CCS encodes underlying correlations as evinced by the first principal components of the last layer for 2k molecules not part of training. Reproduced with permission from ref (220). Copyritht 2013 licensed under a Creative Commons Attribution 3.0 license.

References

    1. Rupp M. Special issue on machine learning and quantum mechanics. Int. J. Quantum Chem. 2015, 115, 1003–1004. 10.1002/qua.24955. - DOI
    1. Rupp M.; von Lilienfeld O. A.; Burke K. Guest Editorial: Special Topic on Data-Enabled Theoretical Chemistry. J. Chem. Phys. 2018, 148, 241401.10.1063/1.5043213. - DOI - PubMed
    1. Schneider W. F.; Guo H. Machine Learning. J. Phys. Chem. A 2018, 122, 879.10.1021/acs.jpca.8b00034. - DOI - PubMed
    1. Prezhdo O. V. Advancing Physical Chemistry with Machine Learning. J. Phys. Chem. Lett. 2020, 11, 9656–9658. 10.1021/acs.jpclett.0c03130. - DOI - PubMed
    1. Tkatchenko A. Machine learning for chemical discovery. Nat. Commun. 2020, 11, 4125.10.1038/s41467-020-17844-8. - DOI - PMC - PubMed

Publication types