Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun 7;9(1):273.
doi: 10.1038/s41597-022-01390-7.

QMugs, quantum mechanical properties of drug-like molecules

Affiliations

QMugs, quantum mechanical properties of drug-like molecules

Clemens Isert et al. Sci Data. .

Abstract

Machine learning approaches in drug discovery, as well as in other areas of the chemical sciences, benefit from curated datasets of physical molecular properties. However, there currently is a lack of data collections featuring large bioactive molecules alongside first-principle quantum chemical information. The open-access QMugs (Quantum-Mechanical Properties of Drug-like Molecules) dataset fills this void. The QMugs collection comprises quantum mechanical properties of more than 665 k biologically and pharmacologically relevant molecules extracted from the ChEMBL database, totaling ~2 M conformers. QMugs contains optimized molecular geometries and thermodynamic data obtained via the semi-empirical method GFN2-xTB. Atomic and molecular properties are provided on both the GFN2-xTB and on the density-functional levels of theory (DFT, ωB97X-D/def2-SVP). QMugs features molecules of significantly larger size than previously-reported collections and comprises their respective quantum mechanical wave functions, including DFT density and orbital matrices. This dataset is intended to facilitate the development of models that learn from molecular data on different levels of theory while also providing insight into the corresponding relationships between molecular structure and biological activity.

PubMed Disclaimer

Conflict of interest statement

G.S. is a cofounder of inSili.com LLC, Zurich, and a consultant to the pharmaceutical industry.

Figures

Fig. 1
Fig. 1
(a) Principal-moments-of-inertia plot for molecules in the QMugs dataset. NPRx = x-th normalized principal moment, Ix = x-th smallest principal moment of inertia. (b) Venn diagram showing overlap between QMugs and other well-known datasets with DFT-level computed properties: QM9, PubChemQC, and ANI-1. Overlap was computed based on the uniqueness of the InChI representations of the contained molecules. Numbers do not add up to those reported in Table 1 because of InChI strings that occur multiple times.
Fig. 2
Fig. 2
Distribution of properties for the molecules contained in the QMugs dataset.
Fig. 3
Fig. 3
Overview of the data generation process. Molecules were extracted from the ChEMBL database, standardized, and filtered, and starting conformers were generated using the RDKit software package. Metadynamics (MTD) simulations were performed using the GFN2-xTB semi-empirical method to generate three diverse conformations before final geometry optimization. Molecules that did not pass a series of geometric sanity checks were removed. DFT-level properties (ωB97X-D/def2-SVP) were computed using Psi4 software.
Fig. 4
Fig. 4
(a) Distributions of mean pairwise RMSD of atom positions between conformations of each molecule in the QMugs dataset at different stages along the pipeline. While the k-means sampling process selects conformations that are, on average, more geometrically diverse than the average pair of structures generated by MTD simulations, geometry optimization reduces the geometrical diversity between the optimized conformers. (b) Change in atom positions during geometry optimization vs. mean pairwise RMSD of conformations before optimization. Molecules with initially more diverse conformations displayed a greater change in atom positions than those with initially less diverse conformations. (c) Distribution of RMSD of structures prior to and after optimization with the semi-empirical GFN2-xTB method, and of structures optimized with the same approach vs. with ωB97X-D/def2-SVP. The structures of three molecules with varying differences between the two methods are shown as illustrative examples (black and gray correspond to GFN2-xTB and ωB97X-D/def2-SVP-optimized structures, respectively). For illustrative purposes, the example molecules are aligned on their substructures.
Fig. 5
Fig. 5
Comparison of molecular properties computed at the two levels of theory considered herein (GFN2-xTB, ωB97X-D/def2-SVP) for the molecules contained in QMugs. The molecular formation energy EForm EForm in (a) was calculated by subtracting the atomic UAtom contributions from the total molecular energies URT. Only the rotational constants A are shown in (c) as their B and C counterparts showed highly similar values. 22 conformations of small molecules show very large rotational constants and are not shown. RMSE and PCC for rotational constant A are 845.834 cm−1 and 0.091 respectively, if those structures are included. Abbreviations: RMSE, root mean squared error; PCC, Pearson’s correlation coefficient.
Fig. 6
Fig. 6
Atom-type-specific partial charge correlations (GFN2-xTB, ωB97X-D/def2-SVP) for the QMugs dataset (see Table S1 in the Supporting Information for additional metrics).
Fig. 7
Fig. 7
Comparison of Wiberg bond orders between GFN2-xTB and ωB97X-D/def2-SVP for the 15 most frequently occurring bond types in the QMugs dataset. The latter level of theory uses Löwdin-orthogonalization. See Table S2 in the Supporting Information for additional metrics. For bond types which occurred > 1 M times in the dataset, a randomly chosen sample of 1 M bonds is plotted.

References

    1. Gawehn E, Hiss JA, Schneider G. Deep learning in drug discovery. Mol. Inform. 2016;35:3–14. doi: 10.1002/minf.201501008. - DOI - PubMed
    1. Schmidt J, Marques MR, Botti S, Marques MA. Recent advances and applications of machine learning in solid-state materials science. Npj Comput. Mater. 2019;5:83. doi: 10.1038/s41524-019-0221-0. - DOI
    1. von Lilienfeld OA. Quantum machine learning in chemical compound space. Angew. Chem. Int. Ed. 2018;57:4164–4169. doi: 10.1002/anie.201709686. - DOI - PubMed
    1. von Lilienfeld OA, Müller K-R, Tkatchenko A. Exploring chemical compound space with quantum-based machine learning. Nat. Rev. Chem. 2020;4:347–358. doi: 10.1038/s41570-020-0189-9. - DOI - PubMed
    1. Satorras, V. G., Hoogeboom, E. & Welling, M. E(n) equivariant graph neural networks. In International Conference on Machine Learning, 9323–9332 (PMLR, 2021).