Hybridization of SMILES and chemical-environment-aware tokens to improve performance of molecular structure generation
- PMID: 40374848
- PMCID: PMC12081657
- DOI: 10.1038/s41598-025-01890-7
Hybridization of SMILES and chemical-environment-aware tokens to improve performance of molecular structure generation
Abstract
The Simplified Molecular Input Line Entry System (SMILES) is one of the most widely adopted molecular representations. However, SMILES notation suffers from limited token diversity and a lack of chemical information within individual tokens. To address these limitations while maintaining its simplicity, we propose a molecular representation method through the hybridization of standard SMILES tokens with Atom-In-SMILES (AIS) tokens, which incorporate local chemical environment information into a single token. This hybrid representation, termed SMI + AIS, combines SMILES and AIS tokens, allowing AIS tokens to differentiate chemical elements based on their chemical context without introducing additional tokens for less frequent elements. Using the SMI + AIS representation, we evaluated its performance by comparing the predefined metric of generated structures in chemical structure generation based on latent space optimization. Compared to standard SMILES, SMI + AIS achieved a 7% improvement in binding affinity and a 6% increase in synthesizability, highlighting its utility in the enhancement of machine learning-based molecular design. Our results demonstrate that the SMI + AIS representation provides a more effective and informative approach to encapsulate chemical context and presents potential for performance enhancement in other machine learning tasks in chemistry.
Keywords: Drug discovery; Generative modeling; Latent space optimization; Molecular representation; SMILES; Small-molecule drug.
© 2025. The Author(s).
Conflict of interest statement
Declarations. Competing interests: The authors declare no competing interests.
Figures






Similar articles
-
Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization.J Cheminform. 2023 May 29;15(1):55. doi: 10.1186/s13321-023-00725-9. J Cheminform. 2023. PMID: 37248531 Free PMC article.
-
Positional embeddings and zero-shot learning using BERT for molecular-property prediction.J Cheminform. 2025 Feb 5;17(1):17. doi: 10.1186/s13321-025-00959-9. J Cheminform. 2025. PMID: 39910649 Free PMC article.
-
XSMILES: interactive visualization for molecules, SMILES and XAI attribution scores.J Cheminform. 2023 Jan 6;15(1):2. doi: 10.1186/s13321-022-00673-w. J Cheminform. 2023. PMID: 36609340 Free PMC article.
-
Training recurrent neural networks as generative neural networks for molecular structures: how does it impact drug discovery?Expert Opin Drug Discov. 2022 Oct;17(10):1071-1079. doi: 10.1080/17460441.2023.2134340. Epub 2022 Oct 17. Expert Opin Drug Discov. 2022. PMID: 36216812 Review.
-
MolGPT: Molecular Generation Using a Transformer-Decoder Model.J Chem Inf Model. 2022 May 9;62(9):2064-2076. doi: 10.1021/acs.jcim.1c00600. Epub 2021 Oct 25. J Chem Inf Model. 2022. PMID: 34694798 Review.
References
-
- Fialkowski, M. et al. Architecture and evolution of organic chemistry. Angew. Chem. Int. Ed.44, 7263–7269. 10.1002/anie.200502272 (2005). - PubMed
-
- Lipkus, A. H. et al. Recent changes in the scaffold diversity of organic chemistry as seen in the CAS registry. J. Org. Chem.84, 13948–13956. 10.1021/acs.joc.9b02111 (2019). - PubMed
-
- Coley, C. W. et al. Convolutional embedding of attributed molecular graphs for physical property prediction. J. Chem. Inf. Model57, 1757–1772. 10.1021/acs.jcim.6b00601 (2017). - PubMed
-
- Guo, Z., Guo, K., Nan, B. et al. Graph-Based Molecular Representation Learning (2022). 10.48550/ARXIV.2207.04869
Grants and funding
LinkOut - more resources
Full Text Sources