Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Mar 18;12(1):17.
doi: 10.1186/s13321-020-00423-w.

Transformer-CNN: Swiss knife for QSAR modeling and interpretation

Affiliations

Transformer-CNN: Swiss knife for QSAR modeling and interpretation

Pavel Karpov et al. J Cheminform. .

Abstract

We present SMILES-embeddings derived from the internal encoder state of a Transformer [1] model trained to canonize SMILES as a Seq2Seq problem. Using a CharNN [2] architecture upon the embeddings results in higher quality interpretable QSAR/QSPR models on diverse benchmark datasets including regression and classification tasks. The proposed Transformer-CNN method uses SMILES augmentation for training and inference, and thus the prognosis is based on an internal consensus. That both the augmentation and transfer learning are based on embeddings allows the method to provide good results for small datasets. We discuss the reasons for such effectiveness and draft future directions for the development of the method. The source code and the embeddings needed to train a QSAR model are available on https://github.com/bigchem/transformer-cnn. The repository also has a standalone program for QSAR prognosis which calculates individual atoms contributions, thus interpreting the model's result. OCHEM [3] environment (https://ochem.eu) hosts the on-line implementation of the method proposed.

Keywords: Augmentation; Character-based models; Cheminformatics; Classification; Convolutional neural neural networks; Embeddings; QSAR; Regression; SMILES; Transformer model.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no actual or potential conflicts of interests.

Figures

Fig. 1
Fig. 1
Benzylpenicillin canonical SMILES at the top, 2D and 3D structures derived from SMILES with OpenBabel [15] in the middle, and three non-canonical SMILES examples at the bottom. A substructure of the phenyl ring is written in bold font
Fig. 2
Fig. 2
Scheme of modern QSAR models based on ANN. The encoder part (left) extracts main features of the input data by means of RNN (top) or convolutional layers (bottom). Then the feature vector as usual descriptors feeds to the dense layer part consisting of residual and highway connections, normalization layers, and dropouts
Fig. 3
Fig. 3
Example of the data in the training file for canonicalization model of a small molecule CHEMBL351484. Every line contains a pair of non-canonical (left) and canonical (right) separated by “ >> ”. One line has identical SMILES on both sides, stressed with the red box
Fig. 4
Fig. 4
The architecture of the Transformer-CNN network
Fig. 5
Fig. 5
Learning curves: 1) learning rate schedule (axes bottom and right), and 2) character-based accuracy (axes bottom and left) on the training dataset for the first four epochs
Fig. 6
Fig. 6
Coefficient of determination, r2, calculated for regression sets (higher values are better)
Fig.7
Fig.7
AUC calculated for classification sets (higher values are better)
Fig. 8
Fig. 8
Visualization of atom contributions, in the case of a mutagenic compound. The red color stands for mutagenic alerts, color green against it
Fig. 9
Fig. 9
Visualization of atom contributions to aqueous solubility of haloperidol. The greep bars stand for more soluble features, whereas the red ones show the opposite effect

References

    1. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Paper presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. arXiv:1706.03762
    1. Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. arXiv e-prints. arXiv:1509.01626
    1. Sushko I, Novotarskyi S, Körner R, et al. Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. J Comput Aided Mol Des. 2011;25:533–554. - PMC - PubMed
    1. Mauri A, Consonni V, Pavan M, Todeschini R. Dragon software: an easy approach to molecular descriptor calculations. Match. 2006;56:237–248.
    1. Baskin I, Varnek A. Chemoinformatics approaches to virtual screening. Cambridge: Royal Society of Chemistry; 2008. Fragment descriptors in SAR/QSAR/QSPR studies, molecular similarity analysis and in virtual screening; pp. 1–43.

LinkOut - more resources