Going beyond SMILES enumeration for data augmentation in generative drug discovery
- PMID: 40917333
- PMCID: PMC12409607
- DOI: 10.1039/d5dd00028a
Going beyond SMILES enumeration for data augmentation in generative drug discovery
Abstract
Data augmentation can alleviate the limitations of small molecular datasets for generative deep learning by 'artificially inflating' the number of instances available for training. SMILES enumeration - wherein multiple valid SMILES strings are used to represent the same molecules - has become particularly beneficial to improve the quality of de novo molecule design. Herein, we investigated whether rethinking SMILES augmentation techniques could further enhance the quality of de novo design. To this end, we introduce four novel approaches for SMILES augmentation, drawing inspiration from natural language processing and chemistry insights: (a) token deletion, (b) atom masking, (c) bioisosteric substitution, and (d) self-training. Via systematic analysis, our results showed the promise of considering additional strategies for SMILES augmentation. Every strategy showed distinct advantages; for example, atom masking is particularly promising to learn desirable physico-chemical properties in very low-data regimes, and deletion to create novel scaffolds. This new repertoire of SMILES augmentation strategies expands the available toolkit to design molecules with bespoke properties in low-data scenarios.
This journal is © The Royal Society of Chemistry.
Conflict of interest statement
There are no conflicts to declare.
Figures




References
LinkOut - more resources
Full Text Sources