Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Nov 16;6(1):249.
doi: 10.1038/s42004-023-01054-6.

Variational autoencoder-based chemical latent space for large molecular structures with 3D complexity

Affiliations

Variational autoencoder-based chemical latent space for large molecular structures with 3D complexity

Toshiki Ochiai et al. Commun Chem. .

Abstract

The structural diversity of chemical libraries, which are systematic collections of compounds that have potential to bind to biomolecules, can be represented by chemical latent space. A chemical latent space is a projection of a compound structure into a mathematical space based on several molecular features, and it can express structural diversity within a compound library in order to explore a broader chemical space and generate novel compound structures for drug candidates. In this study, we developed a deep-learning method, called NP-VAE (Natural Product-oriented Variational Autoencoder), based on variational autoencoder for managing hard-to-analyze datasets from DrugBank and large molecular structures such as natural compounds with chirality, an essential factor in the 3D complexity of compounds. NP-VAE was successful in constructing the chemical latent space from large-sized compounds that were unable to be handled in existing methods, achieving higher reconstruction accuracy, and demonstrating stable performance as a generative model across various indices. Furthermore, by exploring the acquired latent space, we succeeded in comprehensively analyzing a compound library containing natural compounds and generating novel compound structures with optimized functions.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Visualization of chemical latent spaces using t-SNE plot.
a, b The higher the NP-likeness score, the more yellow it is, and the lower the score, the more purple it is. Compared to the chemical latent space (a) trained only on substructures, the chemical latent space (b) trained on both substructures and the NP-likeness score as functional information shows a more clustered distribution according to the NP-likeness score. Comparing the cases when plotting representative anticancer compounds in the chemical latent space trained only on substructures (c) and when plotting anticancer compounds in the chemical latent space trained on both substructures and NP-likeness scores (d), both show clustered distributions according to the anticancer drug classification, with chemotherapeutic drugs and molecular targeted drugs distributed separately. Focusing on the distribution of molecular targeted drugs (red frame), the distribution is more locally clustered when the NP-likeness score is included.
Fig. 2
Fig. 2. Yessotoxin and its EGFR inhibitory activity.
a Structure of Yessotoxin. Yessotoxin was first discovered in the 1980s from a scallop species called Patinopecten yessoensis and since then, various derivatives have been found in crustaceans and algae. b Inhibition of EGF-stimulated EGFR phosphorylation by Yessotoxin. EGFR tyrosine kinase activities are expressed as a percentage of the maximal phosphorylation induced by EGF. AG1478 is a selective EGFR inhibitor and was used as a positive control. When Yessotoxin was at 100 μg/ml, an inhibitory effect of over 80% was confirmed.
Fig. 3
Fig. 3. Generation of compound structures between two existing drugs through interpolation.
Interpolation of novel compound structures obtained by scanning the chemical latent space between two points, with the starting compound structure being a Nicotinamide adenine dinucleotide derivative from a biomolecule and the destination compound structure being Sorafenib, a molecular targeted drug. The three values below each compound structure represent, from left to right, the similarity to the starting compound, the similarity to the destination compound, and the NP-likeness score. As the compounds move closer to the destination point, the similarity to the starting compound gradually decreases, the similarity to the destination compound increases, and the NP-likeness score becomes lower.
Fig. 4
Fig. 4. Generation of novel compound structures using Bayesian optimization.
The objective function to be maximized was set as the quantitative estimate of drug-likeness (QED), and novel compound structures with improved functional indices were explored using Bayesian optimization. The two values below each compound structure represent, from left to right, the QED score and the similarity to the starting compound. In this case, the search space was limited to the vicinity of the target compound, and optimization was performed in both narrow and wide search ranges, examining the effects on the resulting compound structures depending on the search space. When the search range was small, it was possible to obtain novel compound structures with improved QED while maintaining the characteristic structure of the target compound. When the search range was expanded, changes in the characteristic structure were observed, and novel compound structures with significantly improved QED could be obtained.
Fig. 5
Fig. 5. Generating novel compound structures from the vicinity of Gefitinib and calculating the docking scores with EGFR.
a Histogram with the number of generated compounds on the vertical axis and their docking scores on the horizontal axis. There were approximately 5700 novel compound structures with improved docking scores compared to Osimertinib, and about 1600 structures with improved scores compared to Gefitinib. b Novel generated compounds with top docking scores against EGFR. The numbers below the compound structures represent the docking scores. Among these, the majority of the structures contain a kinase-inhibiting quinazoline moiety, known to play a crucial role in EGFR interactions. In addition, it can be seen that the docking scores have been significantly improved due to the addition of other structural components. c Histogram of the docking scores for the virtual compounds generated by the machine-learning-based molecular generation tool, REINVENT (version 3.0).
Fig. 6
Fig. 6. Docking poses between EGFR and Gefitinib, as well as EGFR and the novel generated compounds.
a Docking pose of interaction between EGFR and gefitinib, (b) Docking pose of interaction between EGFR and the novel compound with the highest docking score, and (c) docking pose of interaction between EGFR and the novel compound with the second-highest docking score. Carbon atoms within 4Å of the ligand compound in the EGFR structure are shown in light blue, and the parts where interactions were confirmed in the simulation results are indicated by yellow dashed lines. While Gefitinib is observed to interact with methionine at position 793, the ligand with the highest docking score was confirmed to interact with methionine at position 793, as well as arginine at position 841 and asparagine at position 842. Additionally, for the ligand with the second-highest docking score, interactions were observed with methionine at position 790, cysteine at position 797, and alanine at position 743.
Fig. 7
Fig. 7. Overall structure of NP-VAE.
When the compound structure information is input to the Encoder, the latent variable z is calculated based on the tree structure obtained by preprocessing. In the Decoder, the compound structure is calculated and output using a continuous algorithm with z as the input. During training, a pathway is used in parallel to predict the functional indices of the compound with the latent variable as input. This allows for the construction of a chemical latent space that takes into account not only structural information but also functional information.

Similar articles

Cited by

References

    1. Bohacek RS, McMartin C, Guida WC. The art and practice of structure-based drug design: a molecular modeling perspective. Med Res Rev. 1996;16:3–50. doi: 10.1002/(SICI)1098-1128(199601)16:1<3::AID-MED1>3.0.CO;2-6. - DOI - PubMed
    1. Rodrigues T, Reker D, Schneider P, Schneider G. Counting on natural products for drug design. Nat. Chem. 2016;8:531–541. doi: 10.1038/nchem.2479. - DOI - PubMed
    1. Grisoni F, et al. Combining generative artificial intelligence and on-chip synthesis for de novo drug design. Sci. Adv. 2021;7:eabg3338. doi: 10.1126/sciadv.abg3338. - DOI - PMC - PubMed
    1. Kingma D. P., Welling M. Auto-encoding variational Bayes. Preprint at https://arxiv.org/abs/1312.6114 (2013).
    1. Gómez-Bombarelli R, et al. Automatic chemical design using a data-driven Continuous representation of molecules. ACS Cent. Sci. 2018;4:268–276. doi: 10.1021/acscentsci.7b00572. - DOI - PMC - PubMed