Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar;4(3):200-209.
doi: 10.1038/s43588-024-00602-x. Epub 2024 Mar 8.

Electron density-based GPT for optimization and suggestion of host-guest binders

Affiliations

Electron density-based GPT for optimization and suggestion of host-guest binders

Juan M Parrilla-Gutiérrez et al. Nat Comput Sci. 2024 Mar.

Abstract

Here we present a machine learning model trained on electron density for the production of host-guest binders. These are read out as simplified molecular-input line-entry system (SMILES) format with >98% accuracy, enabling a complete characterization of the molecules in two dimensions. Our model generates three-dimensional representations of the electron density and electrostatic potentials of host-guest systems using a variational autoencoder, and then utilizes these representations to optimize the generation of guests via gradient descent. Finally the guests are converted to SMILES using a transformer. The successful practical application of our model to established molecular host systems, cucurbit[n]uril and metal-organic cages, resulted in the discovery of 9 previously validated guests for CB[6] and 7 unreported guests (with association constant Ka ranging from 13.5 M-1 to 5,470 M-1) and the discovery of 4 unreported guests for [Pd214]4+ (with Ka ranging from 44 M-1 to 529 M-1).

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Discovering novel guest molecules through electron density volumetric representation.
a, The QM9 chemical space (with C, O, N and F referring to carbon, oxygen, nitrogen and fluorine, respectively) was used to train our VAE. Once trained, the latent space created by the VAE (a 1D space) could be navigated, and the 3D structural information of a target molecule was reconstructed using the VAE decoder (molecule generator). Navigating the latent space created, the 3D structural information of a target molecule (molecule generator) was reconstructed using the VAE. Given a target host, gradient descent was used to discover guests that maximize the electrostatic interactions with the host, while minimizing electron density overlap. The 3D volumes of the candidate guests were translated into SMILEs, giving the full chemical information required for their synthesis. b, The potential guest molecules generated by the optimization algorithm for cucurbituril CB[6] and metal–organic cage [Pd214]4+ were selected by an expert chemist for experimental testing based on their structural resemblance with known guests and, second, their commercial availability. The Ka of the guest molecules selected for CB[6] or [Pd214]4+ was quantified by direct 1H NMR titration.
Fig. 2
Fig. 2. Sampling the QM9 chemical space using a VAE.
a, Conversion of the QM9 dataset (DB) in XYZ format (XYZ values are shown solely for representation purposes) to electron densities and electrostatic potentials using quantum mechanical methods and density calculators. xTB refers to the Semiempirical Extended Tight-Binding Program Package software; e refers to partial charges on each atom. b, Training a VAE to model the QM9 chemical space. The encoder side of the VAE was used to encode molecules into their 1D latent representations, while the decoder side of the VAE was used to generate molecules given 1D latent vectors. Molecules were generated into a 3D tensor of 64 units (voxels) per side. µ, σ and z refer to mean, standard deviation and latent space, respectively. c, Utilizing an FCN network to calculate the electrostatic potential of a molecule given its electron density. tanh → log refers to the fact that each element in the input tensor was put through a tanh operation followed by a log operation. CNN, convolutional neural network.
Fig. 3
Fig. 3. Transforming electron densities into SMILES representations using a transformer model followed by optimization of the guests for a target host via gradient descent.
a, Inputs of either decorated or non-decorated electron densities. FF, fully connected feed-forward network; Trans., transformer; Nx refers to the blocks being repeated (or stuck) N times. b, Standard implementation of the transformer model to design a molecule embedding layer transforming 3D volumes into 2D tensors later usable in the different attention mechanisms. In the electrostatic potential tensor, areas in red represent areas with positive electrostatic potential while areas in blue represent areas with negative electrostatic potential. c, Examples of different translated electron densities. d, Implementation of using the probabilities outputted by the last softmax layer to randomly sample one of the tokens, allowing for finding molecules that fit a defined 3D cavity. e, Behavior of the transformer as a GPT model working with SMILES, when the encoder is disabled.
Fig. 4
Fig. 4. Optimizing guests for a target host via gradient descent.
a, Targeting of multiple fitness functions for optimizing host–guest interactions: maximize the size of the guest, minimize its overlapping with the host and maximize its electrostatic interactions. In the right panel, areas in red represent areas with positive electrostatic potential while areas in blue represent areas with negative electrostatic potential. b, Initial population of guests generated through random sampling. Using random sampling, a 1D vector in the latent space was generated. Via the VAE, a 3D electron density could be reconstructed from this 1D vector. From this 3D electron density, and using the FCN, its electrostatic potentials were calculated.
Fig. 5
Fig. 5. Optimization pipeline and generation of SMILES representations of the guests.
a, Optimization pipeline to maximize guest size. b, Optimization pipeline simultaneously minimizing host–guest electron density overlapping while maximizing its electrostatic interactions. ED, electron density; ESP, electrostatic potential. In the electrostatic potential tensor, areas in red represent areas with positive electrostatic potential while areas in blue represent areas with negative electrostatic potential. c, Use of our transformer model to obtain the SMILES representation of the guest generated.
Fig. 6
Fig. 6. Optimized and previously known guests for CB[6] and optimized guests for [Pd214]4+.
a, Structures and log Ka values for guest molecules generated by the optimization algorithm for CB[6] and the structure of CB[6]. Association constants were measured in HCO2H/H2O 1:1 v/v. The association constants between CB[6] and guests 1 to 9 (G1G9) in HCO2H/H2O 1:1 v/v were previously reported in the literature. b, Left: structures and log Ka values for guest molecules previously reported in the literature for [Pd214](BArF)4; association constants were measured in CD2Cl2 (ref. ; these four guests were not generated by our model). Middle: the structure of [Pd214]4+. Right: structures and log Ka values for guest molecules generated by the optimization algorithm for [Pd214](BArF)4. Association constants were measured in CD2Cl2.

References

    1. Polishchuk PG, Madzhidov TI, Varnek A. Estimation of the size of drug-like chemical space based on GDB-17 data. J. Comput. Aided Mol. Des. 2013;27:675–679. doi: 10.1007/s10822-013-9672-4. - DOI - PubMed
    1. Vanhaelen Q, Lin YC, Zhavoronkov A. The advent of generative chemistry. ACS Med. Chem. Lett. 2020;11:1496–1505. doi: 10.1021/acsmedchemlett.0c00088. - DOI - PMC - PubMed
    1. Bickerton GR, Paolini GV, Besnard J, Muresan S, Hopkins AL. Quantifying the chemical beauty of drugs. Nat. Chem. 2012;4:90–98. doi: 10.1038/nchem.1243. - DOI - PMC - PubMed
    1. Polykovskiy D, et al. Molecular Sets (MOSES): a benchmarking platform for molecular generation models. Front. Pharmacol. 2020;11:565644. doi: 10.3389/fphar.2020.565644. - DOI - PMC - PubMed
    1. Atz K, Grisoni F, Schneider G. Geometric deep learning on molecular representations. Nat. Mach. Intell. 2021;3:1023–1032. doi: 10.1038/s42256-021-00418-8. - DOI