Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jun 18;6(12):2326-31.
doi: 10.1021/acs.jpclett.5b00831.

Machine Learning Predictions of Molecular Properties: Accurate Many-Body Potentials and Nonlocality in Chemical Space

Machine Learning Predictions of Molecular Properties: Accurate Many-Body Potentials and Nonlocality in Chemical Space

Katja Hansen et al. J Phys Chem Lett. .

Abstract

Simultaneously accurate and efficient prediction of molecular properties throughout chemical compound space is a critical ingredient toward rational compound design in chemical and pharmaceutical industries. Aiming toward this goal, we develop and apply a systematic hierarchy of efficient empirical methods to estimate atomization and total energies of molecules. These methods range from a simple sum over atoms, to addition of bond energies, to pairwise interatomic force fields, reaching to the more sophisticated machine learning approaches that are capable of describing collective interactions between many atoms or bonds. In the case of equilibrium molecular geometries, even simple pairwise force fields demonstrate prediction accuracy comparable to benchmark energies calculated using density functional theory with hybrid exchange-correlation functionals; however, accounting for the collective many-body interactions proves to be essential for approaching the “holy grail” of chemical accuracy of 1 kcal/mol for both equilibrium and out-of-equilibrium geometries. This remarkable accuracy is achieved by a vectorized representation of molecules (so-called Bag of Bonds model) that exhibits strong nonlocality in chemical space. In addition, the same representation allows us to predict accurate electronic properties of molecules, such as their polarizability and molecular frontier orbital energies.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic overview of employed modeling approaches: The dressed atoms model incorporates only the atoms and weights them according to their type. The sum-over-bonds and 2-body potentials consider pairs of atoms and the interactions between them. The Bag of Bonds model, which is illustrated for the ethanol molecule (C2H5OH), implements a collective energy expression based on all interatomic distances within a molecule.
Figure 2
Figure 2
Polynomial potentials for C–C interaction: The normalized gray histogram refers to the distribution of C–C distances within the GDB-7 data set and is associated with the right-hand axis. The red dots represent the energies of the C–C single, double, and triple bond, as given by fits to experimental bond energies. In blue, polynomial two-body potentials (as trained in cross validation) are shown. The inset shows the difference between potentials for distances between 2.2 and 2.8 Å.
Figure 3
Figure 3
Schematic view of the Bag of Bonds (BoB) representation. (a) 3D structure of ethanol (CH3CH2OH) and (b) involved nuclear charges for each Coulomb matrix element. (c) Different Coulomb matrix entries that are present for ethanol are sorted into bags, and the BoB vector (d) is obtained by concatenating these bags and adding zeros to allow for dealing with other molecules with larger bags.
Figure 4
Figure 4
Estimated atomization energy of the ethanol molecule (C2H5OH) as predicted by the BoB model using Gaussian (blue line) and Laplacian (red line) kernels. The PBE0 reference energy is indicated by the dashed green line. For a given training set size, the estimation is an average of predictions from 10 optimized models, each employing independently sampled training molecules (excluding ethanol) from the GDB-7 database. The envelope encloses the standard deviation of the estimate from 10 independent runs.
Figure 5
Figure 5
Mean absolute error (MAE in kcal/mol) for BoB and polynomial models: Training sets from N = 500 to 7000 data points were sampled identically for the different methods. The polynomial models of degree 10 and 18 exhibit high variances due to the random stratification, which for small N leads to nonrobust fits.
Figure 6
Figure 6
Error distribution of BoB predicted electronic properties polarizability (α), atomization energy (E), and HOMO and LUMO eigenvalues (ϵ) for 2165 randomly drawn out-of-sample molecules from GDB-7 for training set sizes of N = 1000 and 5000, respectively.

References

    1. Kirkpatrick P.; Ellis C. Chemical Space. Nature 2004, 432, 823.
    1. Schneider G. Virtual Screening: An Endless Staircase?. Nat. Rev. 2010, 9, 273. - PubMed
    1. Todeschini R.; Consonni V.. Handbook of Molecular Descriptors; Wiley-VCH: Weinheim, Germany, 2009.
    1. Manzhos S.; Carrington T. Using Neural Networks to Represent Potential Surfaces as Sums of Products. J. Chem. Phys. 2006, 125, 194105. - PubMed
    1. Behler J.; Parrinello M. Generalized Neural-Network Representation of High-Dimensional Potential-Energy Surfaces. Phys. Rev. Lett. 2007, 98, 146401. - PubMed

Publication types

LinkOut - more resources