. 2024 Mar 12;20(5):2152-2166.

doi: 10.1021/acs.jctc.3c01256. Epub 2024 Feb 8.

Highly Accurate Prediction of NMR Chemical Shifts from Low-Level Quantum Mechanics Calculations Using Machine Learning

Jie Li¹, Jiashu Liang¹, Zhe Wang¹, Aleksandra L Ptaszek^{2

3}, Xiao Liu¹, Brad Ganoe¹, Martin Head-Gordon^{1

4}, Teresa Head-Gordon^{1

4

5}

Affiliations

¹ Pitzer Center for Theoretical Chemistry, Department of Chemistry, University of California, Berkeley, California 94720, United States.
² Christian Doppler Laboratory for High-Content Structural Biology and Biotechnology, Department of Structural and Computational Biology, Max Perutz Laboratories, University of Vienna, Campus Vienna Biocenter 5, Vienna 1030, Austria.
³ Laboratory for Computer-Aided Molecular Design, Division of Medicinal Chemistry, Otto Loewi Research Center, Medical University Graz, Neue Stiftingtalstrasse 6/III, Graz 8010, Austria.
⁴ Chemical Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, United States.
⁵ Departments of Bioengineering and Chemical and Biomolecular Engineering, University of California, Berkeley, Berkeley, California 94720, United States.

PMID: 38331423
PMCID: PMC11702896
DOI: 10.1021/acs.jctc.3c01256

Highly Accurate Prediction of NMR Chemical Shifts from Low-Level Quantum Mechanics Calculations Using Machine Learning

Jie Li et al. J Chem Theory Comput. 2024.

. 2024 Mar 12;20(5):2152-2166.

doi: 10.1021/acs.jctc.3c01256. Epub 2024 Feb 8.

Authors

Jie Li¹, Jiashu Liang¹, Zhe Wang¹, Aleksandra L Ptaszek^{2

3}, Xiao Liu¹, Brad Ganoe¹, Martin Head-Gordon^{1

4}, Teresa Head-Gordon^{1

4

5}

Affiliations

¹ Pitzer Center for Theoretical Chemistry, Department of Chemistry, University of California, Berkeley, California 94720, United States.
² Christian Doppler Laboratory for High-Content Structural Biology and Biotechnology, Department of Structural and Computational Biology, Max Perutz Laboratories, University of Vienna, Campus Vienna Biocenter 5, Vienna 1030, Austria.
³ Laboratory for Computer-Aided Molecular Design, Division of Medicinal Chemistry, Otto Loewi Research Center, Medical University Graz, Neue Stiftingtalstrasse 6/III, Graz 8010, Austria.
⁴ Chemical Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, United States.
⁵ Departments of Bioengineering and Chemical and Biomolecular Engineering, University of California, Berkeley, Berkeley, California 94720, United States.

PMID: 38331423
PMCID: PMC11702896
DOI: 10.1021/acs.jctc.3c01256

Abstract

Theoretical predictions of NMR chemical shifts from first-principles can greatly facilitate experimental interpretation and structure identification of molecules in gas, solution, and solid-state phases. However, accurate prediction of chemical shifts using the gold-standard coupled cluster with singles, doubles, and perturbative triple excitations [CCSD(T)] method with a complete basis set (CBS) can be prohibitively expensive. By contrast, machine learning (ML) methods offer inexpensive alternatives for chemical shift predictions but are hampered by generalization to molecules outside the original training set. Here, we propose several new ideas in ML of the chemical shift prediction for H, C, N, and O that first introduce a novel feature representation, based on the atomic chemical shielding tensors within a molecular environment using an inexpensive quantum mechanics (QM) method, and train it to predict NMR chemical shieldings of a high-level composite theory that approaches the accuracy of CCSD(T)/CBS. In addition, we train the ML model through a new progressive active learning workflow that reduces the total number of expensive high-level composite calculations required while allowing the model to continuously improve on unseen data. Furthermore, the algorithm provides an error estimation, signaling potential unreliability in predictions if the error is large. Finally, we introduce a novel approach to keep the rotational invariance of the features using tensor environment vectors (TEVs) that yields a ML model with the highest accuracy compared to a similar model using data augmentation. We illustrate the predictive capacity of the resulting inexpensive shift machine learning (iShiftML) models across several benchmarks, including unseen molecules in the NS372 data set, gas-phase experimental chemical shifts for small organic molecules, and much larger and more complex natural products in which we can accurately differentiate between subtle diastereomers based on chemical shift assignments.

PubMed Disclaimer

Conflict of interest statement

DECLARATION OF INTERESTS

M.H.G. is a part-owner of Q-Chem Inc, whose software was used for many of the calculations reported here.

Figures

**Figure 1:. The iShiftML ensemble learning model that uses low-level QM calculations of the shielding tensor and AEVs to predict high-level chemical shieldings.**
Given a molecular geometry, the AEV around each nucleus is prepared and sent into a MLP network with two layers, each of which contains 128 neurons, in which the ReLU activation function is used for the first layer to encode the AEVs into a 128-dimension internal representation. On the second branch, we perform low-level composite QM calculations to obtain the 18 DIA and PARA chemical shielding values that are concatenated with the AEVs from the first branch to provide input for the second MLP weight network. The weight MLP is composed of a first layer containing 64 neurons and uses ReLU activation, followed by a second layer of 19 neurons (including a bias term) without an activation function.

**Figure 2:. Distributions of weights of the original model without considering rotational invariance for hydrogen atom evaluated on test data.**
Distributions of the weights for diagonal elements in the DIA and PARA matrices are centered close to 1/3, off-diagonal elements are centered around 0, and the bias term is distributed around −0.17.

**Figure 3:. Ensemble prediction and correlation with actual prediction error.**
(a) An ensemble learning approach using 5-fold cross-validation to train individual models in the ensemble. The final prediction is the average prediction from the models after excluding outliers recognized by the Local Outlier Factor algorithm. (b) An undertrained model for oxygen tested on the 8-heavy-atom test set, showing the correlation between predicted and actual values. Data points are colored according to their standard deviation (STD), with warm colors representing high STDs and cool colors representing low STDs. (c) Prediction errors compared to reference values are found to be well correlated with standard deviations of the predictions in the ensemble on a log-log plot. See Methods for further details.

**Figure 4:. Procedure and results of the active learning workflow.**
a) The active learning (AL) workflow. Starting from a model trained with data up to 4 heavy atoms (HA), data with 5HA are evaluated using the trained model, and 1500 structures with the largest predicted standard deviations from the 5HA dataset were included to define the training set for the next iteration until the training set contains molecules up to 7HA. The 8HA dataset was always used as the test set. b-e) RMSE on the 8HA test set for models trained with AL on training sets containing molecules with different sizes (blue curve), and also a baseline model that is trained using linear regression (green dotted line). Figures are for hydrogens (b), carbons (c), nitrogens (d) and oxygens (e). (b-e) are also provided in tabular form in Supplementary Table S2. Note that the RMSEs are calculated with uncertain predictions excluded, which removes any prediction with an ensemble standard deviation larger than 0.5 ppm for H, 2.5 ppm for C, 5 ppm for N, or 10 ppm for O.

**Figure 5:. Architecture of the TEV variant of the iShift ML model.**
As an alternative to the model in Figure 1, in the second branch the DIA and PARA chemical shielding tensors are embedded into a 98-dimension TEV vector that is concatenated with the AEVs from the first branch to provide input for the second MLP weight network. The weight MLP is composed of a first layer containing 64 neurons and uses ReLU activation, followed by a second layer of 3 neurons (including a bias term) without an activation function. In total, the TEV of one atom comprises 98 elements, including 16 reference magnitude indices each for isotropic diamagnetic and paramagnetic values, 32 indices for the final isotropic value, and 16 direction elements each for the diamagnetic and paramagnetic tensors. This combination ensures both magnitude and direction while maintaining rotational invariance.

**Figure 6:. Predicting experimental gas phase chemical shifts for small organic molecules.**
(a) the small molecules under investigation. 3D geometries of these molecules are taken from NS372 and NIST database. (b-d) Distributions of errors between predicted and experimental gas phase NMR chemical shifts for low-level DFT calculations ( $ω$ B97X-V/pcSseg-1, blue distributions) and iShiftML predictions for the high-level CCSD(T) composite method (orange distributions) for hydrogens (b), carbons (c), and nitrogens (d). Also see Figure S3 and Table S6.

**Figure 7:. Results on predicting and comparing CS for strychnine.**
a) Molecular structure of strychnine. b) Absolute prediction error for the low-level DFT method and iShiftML across the experimental CS range for hydrogens. c) Absolute prediction error for the low-level DFT method and iShiftML across the experimental CS range for carbons. All predicted CS are re-referenced to have the same mean values as experimental measurements.

**Figure 8:. Predictive analysis and comparison of chemical shifts for the 8 diastereomers of vannusal B.**
a) Molecular structures of the 8 diastereomers of vannusal B. Reproduced from reference [ 82] Copyright 2011 American Chemical Society. b) The prediction RMSE margins for various vannusal B isomers. The bottom position in each bar represents comparison with the true experimental CS, while the top indicates a comparison to vannusal B CS in its native form, 5-2. Predictions were made using iShiftML, low-level DFT, and M06/pcS-2 (the latter from Ref. 82). Large bars with a low bottom therefore indicate good discrimination between predicted CS for the true structure against potential misidentification with CS of the native structure. All predicted CS are re-referenced to have the same mean values as experimental measurements. Also see Figure S5.

See this image and copyright information in PMC

Cited by

UCBShift 2.0: Bridging the Gap from Backbone to Side Chain Protein Chemical Shift Prediction for Protein Structures.
Ptaszek AL, Li J, Konrat R, Platzer G, Head-Gordon T. Ptaszek AL, et al. J Am Chem Soc. 2024 Nov 20;146(46):31733-31745. doi: 10.1021/jacs.4c10474. Epub 2024 Nov 12. J Am Chem Soc. 2024. PMID: 39531038 Free PMC article.
Accurate Prediction of NMR Chemical Shifts: Integrating DFT Calculations with Three-Dimensional Graph Neural Networks.
Han C, Zhang D, Xia S, Zhang Y. Han C, et al. J Chem Theory Comput. 2024 Jun 25;20(12):5250-5258. doi: 10.1021/acs.jctc.4c00422. Epub 2024 Jun 6. J Chem Theory Comput. 2024. PMID: 38842505 Free PMC article.
Accurate and Efficient Structure Elucidation from Routine One-Dimensional NMR Spectra Using Multitask Machine Learning.
Hu F, Chen MS, Rotskoff GM, Kanan MW, Markland TE. Hu F, et al. ACS Cent Sci. 2024 Nov 13;10(11):2162-2170. doi: 10.1021/acscentsci.4c01132. eCollection 2024 Nov 27. ACS Cent Sci. 2024. PMID: 39634219 Free PMC article.
The interplay of density functional selection and crystal structure for accurate NMR chemical shift predictions.
Ramos SA, Mueller LJ, Beran GJO. Ramos SA, et al. Faraday Discuss. 2025 Jan 8;255(0):119-142. doi: 10.1039/d4fd00072b. Faraday Discuss. 2025. PMID: 39258864
Bent naphthodithiophenes: synthesis and characterization of isomeric fluorophores.
Adusei EBA, Casetti VT, Goldsmith CD, Caswell M, Alinj D, Park J, Zeller M, Rusakov AA, Kinney ZJ. Adusei EBA, et al. RSC Adv. 2024 Aug 12;14(35):25120-25129. doi: 10.1039/d4ra04850d. eCollection 2024 Aug 12. RSC Adv. 2024. PMID: 39139244 Free PMC article.

References

1. Jacobsen NE NMR data interpretation explained: understanding 1D and 2D NMR spectra of organic compounds and natural products; John Wiley & Sons, 2016.
1. Hore PJ Nuclear magnetic resonance; Oxford University Press, USA, 2015.
1. Derome AE Modern NMR techniques for chemistry research; Elsevier, 2013.
1. Saielli G; Nicolaou K; Ortiz A; Zhang H; Bagno A Addressing the stereochemistry of complex organic molecules by density functional theory-NMR: Vannusal B in retrospective. J. Am. Chem. Soc 2011, 133, 6072–6077. - PMC - PubMed
1. Wüthrich K. Protein structure determination in solution by NMR spectroscopy. J. Biol. Chem 1990, 265, 22059–22062. - PubMed

Grants and funding

U01 GM121667/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
- PubMed Central
- eScholarship, University of California - Access Free Full Text

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Highly Accurate Prediction of NMR Chemical Shifts from Low-Level Quantum Mechanics Calculations Using Machine Learning

Affiliations

Highly Accurate Prediction of NMR Chemical Shifts from Low-Level Quantum Mechanics Calculations Using Machine Learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Related information

Grants and funding

LinkOut - more resources

Full Text Sources