Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 14.
doi: 10.1039/d5dd00028a. Online ahead of print.

Going beyond SMILES enumeration for data augmentation in generative drug discovery

Affiliations

Going beyond SMILES enumeration for data augmentation in generative drug discovery

Helena Brinkmann et al. Digit Discov. .

Abstract

Data augmentation can alleviate the limitations of small molecular datasets for generative deep learning by 'artificially inflating' the number of instances available for training. SMILES enumeration - wherein multiple valid SMILES strings are used to represent the same molecules - has become particularly beneficial to improve the quality of de novo molecule design. Herein, we investigated whether rethinking SMILES augmentation techniques could further enhance the quality of de novo design. To this end, we introduce four novel approaches for SMILES augmentation, drawing inspiration from natural language processing and chemistry insights: (a) token deletion, (b) atom masking, (c) bioisosteric substitution, and (d) self-training. Via systematic analysis, our results showed the promise of considering additional strategies for SMILES augmentation. Every strategy showed distinct advantages; for example, atom masking is particularly promising to learn desirable physico-chemical properties in very low-data regimes, and deletion to create novel scaffolds. This new repertoire of SMILES augmentation strategies expands the available toolkit to design molecules with bespoke properties in low-data scenarios.

PubMed Disclaimer

Conflict of interest statement

There are no conflicts to declare.

Figures

Fig. 1
Fig. 1. Overview of SMILES augmentation methods. (a) SMILES enumeration (used as a baseline in this work), where multiple SMILES strings are obtained by starting the graph traversal from different non-hydrogen atoms and/or by proceeding in different directions. (b) Token deletion, where new SMILES strings are generated by randomly removing tokens from the original string. (c) Atom masking, where atoms are randomly replaced with dummy tokens (‘[*]’). (d) Bioisosteric substitution, where pre-defined functional groups are substituted with their reported bioisosteres. (e) Self-training, where novel SMILES are generated by a trained CLM and used in turn to the initial set for the next training phase.
Fig. 2
Fig. 2. Syntactic validity of SMILES across augmentation strategies and augmentation folds. Several folds of augmentation (three- and ten-folds), across five training set sizes (1000, 2500, 5000, 75 000, and 10 000 SMILES) were analyzed. For each set-up, 1000 SMILES strings were generated across four repetitions for the analysis. The highest validity obtained by SMILES enumeration and without any augmentation is represented as solid and dashed lines, respectively. Statistically significant differences (one-sided Wilcoxon rank-sum test, p < 0.05) between the new augmentation approaches and SMILES enumeration (10×) are marked with asterisks.
Fig. 3
Fig. 3. Distribution learning after fine-tuning. The Kolmogorov–Smirnov (KS) distance for eight selected descriptors was calculated between 3000 designs and the respective fine-tuning sets (the lower the KS, the better). (a) KS distances grouped by fine-tuning set similarity (high/low) and number of fine-tuning molecules (10, 100). Statistically significant differences (Wilcoxon signed-rank test, p < 0.05) between the new augmentation approaches and no augmentation or SMILES enumeration are marked with asterisks. (b–e) Principal component analysis (PCA) obtained on the KS values for different dataset sizes (b and d: 10; c and e: 100) and similarity levels (b and c: high; d and e: low). ‘Best’ and ‘Worst’ indicate the lowest and highest values of KS obtained across experiments, and the line connecting represents the direction of average performance variation from the best to worst performance.
Fig. 4
Fig. 4. Percentage of the most common scaffolds after training with each method for PPARδ. The most common scaffolds of the PPARδ fine-tuning sets were determined, and for each method, the percentage of the matched scaffold of the 4000 designs was calculated for different dataset sizes (a and c: 10; b and d: 100) and similarity levels (a and b: high; c and d: low). The most common scaffolds are visualized above every graph with the percentage of its occurrence in the fine-tuning set. The analysis for PIM and JAK2 can be found in Supporting Fig. S6 and S7, respectively.

References

    1. Reymond J.-L. The Chemical Space Project. Acc. Chem. Res. 2015;48:722–730. - PubMed
    1. Grisoni F. Chemical language models for de novo drug design: Challenges and opportunities. Curr. Opin. Struct. Biol. 2023;79:102527. - PubMed
    1. Flam-Shepherd D. Zhu K. Aspuru-Guzik A. Language models can learn complex molecular distributions. Nat. Commun. 2022;13:3293. - PMC - PubMed
    1. Özçelik R. de Ruiter S. Criscuolo E. Grisoni F. Chemical language modeling with structured state space sequence models. Nat. Commun. 2024;15:6176. - PMC - PubMed
    1. Yuan W. et al., Chemical Space Mimicry for Drug Discovery. J. Chem. Inf. Model. 2017;57:875–882. - PMC - PubMed

LinkOut - more resources