Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Sep 26;19(18):6185-6196.
doi: 10.1021/acs.jctc.3c00491. Epub 2023 Sep 13.

Treating Semiempirical Hamiltonians as Flexible Machine Learning Models Yields Accurate and Interpretable Results

Affiliations

Treating Semiempirical Hamiltonians as Flexible Machine Learning Models Yields Accurate and Interpretable Results

Frank Hu et al. J Chem Theory Comput. .

Abstract

Quantum chemistry provides chemists with invaluable information, but the high computational cost limits the size and type of systems that can be studied. Machine learning (ML) has emerged as a means to dramatically lower the cost while maintaining high accuracy. However, ML models often sacrifice interpretability by using components such as the artificial neural networks of deep learning that function as black boxes. These components impart the flexibility needed to learn from large volumes of data but make it difficult to gain insight into the physical or chemical basis for the predictions. Here, we demonstrate that semiempirical quantum chemical (SEQC) models can learn from large volumes of data without sacrificing interpretability. The SEQC model is that of density-functional-based tight binding (DFTB) with fixed atomic orbital energies and interactions that are one-dimensional functions of the interatomic distance. This model is trained to ab initio data in a manner that is analogous to that used to train deep learning models. Using benchmarks that reflect the accuracy of the training data, we show that the resulting model maintains a physically reasonable functional form while achieving an accuracy, relative to coupled cluster energies with a complete basis set extrapolation (CCSD(T)*/CBS), that is comparable to that of density functional theory (DFT). This suggests that trained SEQC models can achieve a low computational cost and high accuracy without sacrificing interpretability. Use of a physically motivated model form also substantially reduces the amount of ab initio data needed to train the model compared to that required for deep learning models.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

Figure 1
Figure 1
Comparison of different quantum chemistry methods on atomization energies (eq 2 and Section 4). The heatmap is generated from the ∼230 k molecular configurations in the ANI-1CCX data set with up to eight heavy atoms, after removing configurations with incomplete entries. The DFTBML DFT/CC parametrizations were trained to wB97x/def2-TZVPP or CCSD(T)*/CBS energies, respectively, on 20 000 molecules with up to eight heavy atoms. Agreement of DFTBML with the method to which it was trained is highlighted in white boxes. DFTBML improves substantially on currently published DFTB parameters (MIO and Auorg), with the agreement between DFTBML CC and CCSD(T)*/CBS being somewhat better than that between DFT (wB97x/def2-TZVPP) and CCSD(T)*/CBS.
Figure 2
Figure 2
Distributions of internuclear distances between H–H, C–C, C–H, and N–O in the cleaned ANI-1CCX data set for molecules with up to eight heavy atoms. Repulsive interactions are truncated beyond nearest-neighbor interactions (blue arrows) with a lower bound of 0 Å. Electronic interactions go to a longer range (4.5 Å) with a lower bound slightly lower than the shortest distance in a given distribution. Precise cutoffs for electronic and repulsive splines can be found in Tables S2 and S3, respectively.
Figure 3
Figure 3
Overview of the method used to generate data sets. Red arrows indicate random sampling. Molecules are divided based on their empirical formulas, ensuring no mixing between training and testing data.
Figure 4
Figure 4
Effects of regularization on (C2p|N2p)σ overlaps (S, top row) and Hamiltonian elements (H1, bottom row): no regularization (left column), convex penalty that constrains the sign of the second derivative (middle column), and convex plus smoothing that penalizes the magnitude of the third derivative (right column). The Auorg reference functions (orange, dashed lines) are included for comparison to the functions trained on the Transfer CC 2500 data set (blue).
Figure 5
Figure 5
Trade-off between MAE in total energy and dipoles as a function of the dipole weighting factor for DFTBML CC 2500. A weighting factor of 100 (eÅ)−1 was chosen to improve the performance on dipoles while only marginally impacting the performance on total energy. More details on hyperparameter sensitivities can be found in Section S12.
Figure 6
Figure 6
Final training, validation, and testing losses for each of the physical targets as a function of the size of the data set used for training. Results are for training to the CC energy target. Error bars are shown as formula image where σ is the standard deviation of the errors calculated separately for the training, validation, and testing values.
Figure 7
Figure 7
Example splines for Hamiltonian elements (H1, left) and overlap elements (S, right) generated from DFTBML training on the CC targets of disjoint data sets, each containing 5000 molecules. The Auorg potentials are included for reference.
Figure 8
Figure 8
High-level overview of the DFTBML model workflow. Note that model testing uses DFTB+ and is external to model training (lower right).
Figure 9
Figure 9
Schematic illustration of inverting the SCF (orange arrows) and training (blue arrows) loops of the DFTBML workflow. In the outer loop, the charge fluctuations needed for the Fock operator are updated based on the current model parameters. The repulsive model is updated on the same schedule as for the charge fluctuations.

Similar articles

Cited by

References

    1. Whitfield J. D.; Love P. J.; Aspuru-Guzik A. Computational complexity in electronic structure. Phys. Chem. Chem. Phys. 2013, 15, 397–411. 10.1039/C2CP42695A. - DOI - PubMed
    1. Köppl C.; Werner H. J. Parallel and low-order scaling implementation of Hartree-Fock exchange using local density fitting. J. Chem. Theory Comput. 2016, 12, 3122–3134. 10.1021/acs.jctc.6b00251. - DOI - PubMed
    1. Scuseria G. E. Comparison of coupled-cluster results with a hybrid of Hartree-Fock and density functional theory. J. Chem. Phys. 1992, 97, 7528–7530. 10.1063/1.463977. - DOI
    1. Mardirossian N.; McClain J. D.; Chan G. K. L. Lowering of the complexity of quantum chemistry methods by choice of representation. J. Chem. Phys. 2018, 148, 04410610.1063/1.5007779. - DOI - PubMed
    1. Gruber T.; Liao K.; Tsatsoulis T.; Hummel F.; Grüneis A. Applying the Coupled-Cluster Ansatz to Solids and Surfaces in the Thermodynamic Limit. Phys. Rev. X 2018, 8, 02104310.1103/PhysRevX.8.021043. - DOI