. 2023 Sep 26;19(18):6185-6196.

doi: 10.1021/acs.jctc.3c00491. Epub 2023 Sep 13.

Treating Semiempirical Hamiltonians as Flexible Machine Learning Models Yields Accurate and Interpretable Results

Frank Hu¹, Francis He¹, David J Yaron¹

Affiliations

PMID: 37705220
PMCID: PMC10536991
DOI: 10.1021/acs.jctc.3c00491

Treating Semiempirical Hamiltonians as Flexible Machine Learning Models Yields Accurate and Interpretable Results

Frank Hu et al. J Chem Theory Comput. 2023.

. 2023 Sep 26;19(18):6185-6196.

doi: 10.1021/acs.jctc.3c00491. Epub 2023 Sep 13.

Authors

Frank Hu¹, Francis He¹, David J Yaron¹

Affiliation

¹ Department of Chemistry, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States.

PMID: 37705220
PMCID: PMC10536991
DOI: 10.1021/acs.jctc.3c00491

Abstract

Quantum chemistry provides chemists with invaluable information, but the high computational cost limits the size and type of systems that can be studied. Machine learning (ML) has emerged as a means to dramatically lower the cost while maintaining high accuracy. However, ML models often sacrifice interpretability by using components such as the artificial neural networks of deep learning that function as black boxes. These components impart the flexibility needed to learn from large volumes of data but make it difficult to gain insight into the physical or chemical basis for the predictions. Here, we demonstrate that semiempirical quantum chemical (SEQC) models can learn from large volumes of data without sacrificing interpretability. The SEQC model is that of density-functional-based tight binding (DFTB) with fixed atomic orbital energies and interactions that are one-dimensional functions of the interatomic distance. This model is trained to ab initio data in a manner that is analogous to that used to train deep learning models. Using benchmarks that reflect the accuracy of the training data, we show that the resulting model maintains a physically reasonable functional form while achieving an accuracy, relative to coupled cluster energies with a complete basis set extrapolation (CCSD(T)*/CBS), that is comparable to that of density functional theory (DFT). This suggests that trained SEQC models can achieve a low computational cost and high accuracy without sacrificing interpretability. Use of a physically motivated model form also substantially reduces the amount of ab initio data needed to train the model compared to that required for deep learning models.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

**Figure 1**
Comparison of different quantum chemistry methods on atomization energies (eq 2 and Section 4). The heatmap is generated from the ∼230 k molecular configurations in the ANI-1CCX data set with up to eight heavy atoms, after removing configurations with incomplete entries. The DFTBML DFT/CC parametrizations were trained to wB97x/def2-TZVPP or CCSD(T)*/CBS energies, respectively, on 20 000 molecules with up to eight heavy atoms. Agreement of DFTBML with the method to which it was trained is highlighted in white boxes. DFTBML improves substantially on currently published DFTB parameters (MIO and Auorg), with the agreement between DFTBML CC and CCSD(T)*/CBS being somewhat better than that between DFT (wB97x/def2-TZVPP) and CCSD(T)*/CBS.

**Figure 2**
Distributions of internuclear distances between H–H, C–C, C–H, and N–O in the cleaned ANI-1CCX data set for molecules with up to eight heavy atoms. Repulsive interactions are truncated beyond nearest-neighbor interactions (blue arrows) with a lower bound of 0 Å. Electronic interactions go to a longer range (4.5 Å) with a lower bound slightly lower than the shortest distance in a given distribution. Precise cutoffs for electronic and repulsive splines can be found in Tables S2 and S3, respectively.

**Figure 3**
Overview of the method used to generate data sets. Red arrows indicate random sampling. Molecules are divided based on their empirical formulas, ensuring no mixing between training and testing data.

**Figure 4**
Effects of regularization on (C_2p|N_2p)_σ overlaps (S, top row) and Hamiltonian elements (H₁, bottom row): no regularization (left column), convex penalty that constrains the sign of the second derivative (middle column), and convex plus smoothing that penalizes the magnitude of the third derivative (right column). The Auorg reference functions (orange, dashed lines) are included for comparison to the functions trained on the Transfer CC 2500 data set (blue).

**Figure 5**
Trade-off between MAE in total energy and dipoles as a function of the dipole weighting factor for DFTBML CC 2500. A weighting factor of 100 (eÅ)⁻¹ was chosen to improve the performance on dipoles while only marginally impacting the performance on total energy. More details on hyperparameter sensitivities can be found in Section S12.

**Figure 6**
Final training, validation, and testing losses for each of the physical targets as a function of the size of the data set used for training. Results are for training to the CC energy target. Error bars are shown as where σ is the standard deviation of the errors calculated separately for the training, validation, and testing values.

formula image — **Figure 6**
Final training, validation, and testing losses for each of the physical targets as a function of the size of the data set used for training. Results are for training to the CC energy target. Error bars are shown as where σ is the standard deviation of the errors calculated separately for the training, validation, and testing values.

**Figure 7**
Example splines for Hamiltonian elements (H₁, left) and overlap elements (S, right) generated from DFTBML training on the CC targets of disjoint data sets, each containing 5000 molecules. The Auorg potentials are included for reference.

**Figure 8**
High-level overview of the DFTBML model workflow. Note that model testing uses DFTB+ and is external to model training (lower right).

**Figure 9**
Schematic illustration of inverting the SCF (orange arrows) and training (blue arrows) loops of the DFTBML workflow. In the outer loop, the charge fluctuations needed for the Fock operator are updated based on the current model parameters. The repulsive model is updated on the same schedule as for the charge fluctuations.

See this image and copyright information in PMC

Cited by

OpenMM 8: Molecular Dynamics Simulation with Machine Learning Potentials.
Eastman P, Galvelis R, Peláez RP, Abreu CRA, Farr SE, Gallicchio E, Gorenko A, Henry MM, Hu F, Huang J, Krämer A, Michel J, Mitchell JA, Pande VS, Rodrigues JP, Rodriguez-Guerra J, Simmonett AC, Singh S, Swails J, Turner P, Wang Y, Zhang I, Chodera JD, De Fabritiis G, Markland TE. Eastman P, et al. J Phys Chem B. 2024 Jan 11;128(1):109-116. doi: 10.1021/acs.jpcb.3c06662. Epub 2023 Dec 28. J Phys Chem B. 2024. PMID: 38154096 Free PMC article.
Cross-disciplinary perspectives on the potential for artificial intelligence across chemistry.
Mroz AM, Basford AR, Hastedt F, Jayasekera IS, Mosquera-Lois I, Sedgwick R, Ballester PJ, Bocarsly JD, Antonio Del Río Chanona E, Evans ML, Frost JM, Ganose AM, Greenaway RL, Kuok Mimi Hii K, Li Y, Misener R, Walsh A, Zhang D, Jelfs KE. Mroz AM, et al. Chem Soc Rev. 2025 Jun 3;54(11):5433-5469. doi: 10.1039/d5cs00146c. Chem Soc Rev. 2025. PMID: 40278836 Free PMC article. Review.
Efficient Parameterization of Density Functional Tight-Binding for 5f-Elements: A Th-O Case Study.
Liu C, Aguirre NF, Cawkwell MJ, Batista ER, Yang P. Liu C, et al. J Chem Theory Comput. 2024 Jul 23;20(14):5923-5936. doi: 10.1021/acs.jctc.4c00145. Epub 2024 Jul 11. J Chem Theory Comput. 2024. PMID: 38990696 Free PMC article.
Data Generation for Machine Learning Interatomic Potentials and Beyond.
Kulichenko M, Nebgen B, Lubbers N, Smith JS, Barros K, Allen AEA, Habib A, Shinkle E, Fedik N, Li YW, Messerly RA, Tretiak S. Kulichenko M, et al. Chem Rev. 2024 Dec 25;124(24):13681-13714. doi: 10.1021/acs.chemrev.4c00572. Epub 2024 Nov 21. Chem Rev. 2024. PMID: 39572011 Free PMC article. Review.

References

1. Whitfield J. D.; Love P. J.; Aspuru-Guzik A. Computational complexity in electronic structure. Phys. Chem. Chem. Phys. 2013, 15, 397–411. 10.1039/C2CP42695A. - DOI - PubMed
1. Köppl C.; Werner H. J. Parallel and low-order scaling implementation of Hartree-Fock exchange using local density fitting. J. Chem. Theory Comput. 2016, 12, 3122–3134. 10.1021/acs.jctc.6b00251. - DOI - PubMed
1. Scuseria G. E. Comparison of coupled-cluster results with a hybrid of Hartree-Fock and density functional theory. J. Chem. Phys. 1992, 97, 7528–7530. 10.1063/1.463977. - DOI
1. Mardirossian N.; McClain J. D.; Chan G. K. L. Lowering of the complexity of quantum chemistry methods by choice of representation. J. Chem. Phys. 2018, 148, 04410610.1063/1.5007779. - DOI - PubMed
1. Gruber T.; Liao K.; Tsatsoulis T.; Hummel F.; Grüneis A. Applying the Coupled-Cluster Ansatz to Solids and Surfaces in the Thermodynamic Limit. Phys. Rev. X 2018, 8, 02104310.1103/PhysRevX.8.021043. - DOI

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Treating Semiempirical Hamiltonians as Flexible Machine Learning Models Yields Accurate and Interpretable Results

Affiliation

Treating Semiempirical Hamiltonians as Flexible Machine Learning Models Yields Accurate and Interpretable Results

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Related information

LinkOut - more resources

Full Text Sources