Transformer-CNN: Swiss knife for QSAR modeling and interpretation
- PMID: 33431004
- PMCID: PMC7079452
- DOI: 10.1186/s13321-020-00423-w
Transformer-CNN: Swiss knife for QSAR modeling and interpretation
Abstract
We present SMILES-embeddings derived from the internal encoder state of a Transformer [1] model trained to canonize SMILES as a Seq2Seq problem. Using a CharNN [2] architecture upon the embeddings results in higher quality interpretable QSAR/QSPR models on diverse benchmark datasets including regression and classification tasks. The proposed Transformer-CNN method uses SMILES augmentation for training and inference, and thus the prognosis is based on an internal consensus. That both the augmentation and transfer learning are based on embeddings allows the method to provide good results for small datasets. We discuss the reasons for such effectiveness and draft future directions for the development of the method. The source code and the embeddings needed to train a QSAR model are available on https://github.com/bigchem/transformer-cnn. The repository also has a standalone program for QSAR prognosis which calculates individual atoms contributions, thus interpreting the model's result. OCHEM [3] environment (https://ochem.eu) hosts the on-line implementation of the method proposed.
Keywords: Augmentation; Character-based models; Cheminformatics; Classification; Convolutional neural neural networks; Embeddings; QSAR; Regression; SMILES; Transformer model.
Conflict of interest statement
The authors declare that they have no actual or potential conflicts of interests.
Figures
References
-
- Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Paper presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. arXiv:1706.03762
-
- Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. arXiv e-prints. arXiv:1509.01626
-
- Mauri A, Consonni V, Pavan M, Todeschini R. Dragon software: an easy approach to molecular descriptor calculations. Match. 2006;56:237–248.
-
- Baskin I, Varnek A. Chemoinformatics approaches to virtual screening. Cambridge: Royal Society of Chemistry; 2008. Fragment descriptors in SAR/QSAR/QSPR studies, molecular similarity analysis and in virtual screening; pp. 1–43.
Grants and funding
LinkOut - more resources
Full Text Sources
