Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 15;15(1):16892.
doi: 10.1038/s41598-025-01890-7.

Hybridization of SMILES and chemical-environment-aware tokens to improve performance of molecular structure generation

Affiliations

Hybridization of SMILES and chemical-environment-aware tokens to improve performance of molecular structure generation

Herim Han et al. Sci Rep. .

Abstract

The Simplified Molecular Input Line Entry System (SMILES) is one of the most widely adopted molecular representations. However, SMILES notation suffers from limited token diversity and a lack of chemical information within individual tokens. To address these limitations while maintaining its simplicity, we propose a molecular representation method through the hybridization of standard SMILES tokens with Atom-In-SMILES (AIS) tokens, which incorporate local chemical environment information into a single token. This hybrid representation, termed SMI + AIS, combines SMILES and AIS tokens, allowing AIS tokens to differentiate chemical elements based on their chemical context without introducing additional tokens for less frequent elements. Using the SMI + AIS representation, we evaluated its performance by comparing the predefined metric of generated structures in chemical structure generation based on latent space optimization. Compared to standard SMILES, SMI + AIS achieved a 7% improvement in binding affinity and a 6% increase in synthesizability, highlighting its utility in the enhancement of machine learning-based molecular design. Our results demonstrate that the SMI + AIS representation provides a more effective and informative approach to encapsulate chemical context and presents potential for performance enhancement in other machine learning tasks in chemistry.

Keywords: Drug discovery; Generative modeling; Latent space optimization; Molecular representation; SMILES; Small-molecule drug.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Overview of SMILES, AIS and their hybridization for a benzoic acid. (Right) The molecular structure of benzoic acid and (Bottom) its string representation using SMILES (SMI) and SMI + AIS methods are illustrated. (Left) Individual SMILES and AIS tokens for the oxygen in carboxyl group are noted. The AIS token includes atomic environmental information (central atom;ring-formation;neighbor atom).
Fig. 2
Fig. 2
Token frequency distribution of the database using AIS + SMI(N) representation. The distributions are shown for (a) SMI, (b) SMI + AIS(50), (c) SMI + AIS(100), (d) SMI + AIS(150), and (e) SMI + AIS(200) representations where the numbers in the parenthesis indicate the numbers of added AIS tokens. Red, green, and blue bars indicate non-physical tokens, AIS tokens, and SMI element tokens, respectively.
Fig. 3
Fig. 3
Workflow of molecular structure generation. The ‘Calculate Objectives’ step computes the objective value, designed as formula image where BA and SA indicate binding affinity and synthetic accessibility score, respectively. The ‘Bayesian Opt’ step generates candidate vectors and ‘Syntax Validation Check’ step verifies grammar of all generated string representations using RDKit. ‘Encoder’ and ‘Decoder’ refer to the components of conditional variational autoencoder that convert string to latent vector and back, respectively. The Bayesian optimization is iteratively performed incorporating updated information from the generated molecular representations.
Fig. 4
Fig. 4
Molecular structure generation results for PDK4 target with various representation methods. (a) Initial structures and their properties (BA, SA, and MW indicates binding affinity, synthetic accessibility and molecular weights, respectively) (b) The distributions of objective values of Top-1 structures obtained from 10 independently performed optimizations. Left, mid and right subplot shows the optimized Top-1 results in 1, 3, 5 iterations. The red line denoted the maximum objective value among the initial compounds. (c) The 10 Top-1 structures and their molecular properties derived from 10 independent optimizations with SMI + AIS(100) representation. The red and green colors in (a) and (c) indicates acetamide and piperidine, respectively.
Fig. 5
Fig. 5
Synthetic accessibility (SA) and binding affinity (BA) of generated compounds targeting PDK4. The plots show 2D density maps of molecules generated with (a) SELFIES, (b) SMI, and (c) SMI + AIS(100). Red crosses represent the scores of initial compounds, while black stars indicate the scores of Top-1 optimized structures.
Fig. 6
Fig. 6
The distributions of Top-k objective values from the optimizations using 4 different protein targets. (a) Top-1, (b) Top-10, and (c) Top-100 results from 10 independent molecular generations are displayed. The green- and orange-filled regions denote the distributions of objective values from optimizations with SMI and SMI + AIS(100), respectively.

Similar articles

References

    1. Fialkowski, M. et al. Architecture and evolution of organic chemistry. Angew. Chem. Int. Ed.44, 7263–7269. 10.1002/anie.200502272 (2005). - PubMed
    1. Lipkus, A. H. et al. Recent changes in the scaffold diversity of organic chemistry as seen in the CAS registry. J. Org. Chem.84, 13948–13956. 10.1021/acs.joc.9b02111 (2019). - PubMed
    1. Coley, C. W. et al. Convolutional embedding of attributed molecular graphs for physical property prediction. J. Chem. Inf. Model57, 1757–1772. 10.1021/acs.jcim.6b00601 (2017). - PubMed
    1. Kojima, R. et al. kGCN: A graph-based deep learning framework for chemical structures. J. Cheminform.12, 32. 10.1186/s13321-020-00435-6 (2020). - PMC - PubMed
    1. Guo, Z., Guo, K., Nan, B. et al. Graph-Based Molecular Representation Learning (2022). 10.48550/ARXIV.2207.04869

LinkOut - more resources