Descriptor Free QSAR Modeling Using Deep Learning With Long Short-Term Memory Neural Networks

Suman K Chakravarti¹, Sai Radha Mani Alla¹

Affiliations

PMID: 33733106
PMCID: PMC7861338
DOI: 10.3389/frai.2019.00017

Descriptor Free QSAR Modeling Using Deep Learning With Long Short-Term Memory Neural Networks

Suman K Chakravarti et al. Front Artif Intell. 2019.

. 2019 Sep 6:2:17.

doi: 10.3389/frai.2019.00017. eCollection 2019.

Authors

Suman K Chakravarti¹, Sai Radha Mani Alla¹

Affiliation

¹ MultiCASE Inc., Beachwood, OH, United States.

PMID: 33733106
PMCID: PMC7861338
DOI: 10.3389/frai.2019.00017

Abstract

Current practice of building QSAR models usually involves computing a set of descriptors for the training set compounds, applying a descriptor selection algorithm and finally using a statistical fitting method to build the model. In this study, we explored the prospects of building good quality interpretable QSARs for big and diverse datasets, without using any pre-calculated descriptors. We have used different forms of Long Short-Term Memory (LSTM) neural networks to achieve this, trained directly using either traditional SMILES codes or a new linear molecular notation developed as part of this work. Three endpoints were modeled: Ames mutagenicity, inhibition of P. falciparum Dd2 and inhibition of Hepatitis C Virus, with training sets ranging from 7,866 to 31,919 compounds. To boost the interpretability of the prediction results, attention-based machine learning mechanism, jointly with a bidirectional LSTM was used to detect structural alerts for the mutagenicity data set. Traditional fragment descriptor-based models were used for comparison. As per the results of the external and cross-validation experiments, overall prediction accuracies of the LSTM models were close to the fragment-based models. However, LSTM models were superior in predicting test chemicals that are dissimilar to the training set compounds, a coveted quality of QSAR models in real world applications. In summary, it is possible to build QSAR models using LSTMs without using pre-computed traditional descriptors, and models are far from being "black box." We wish that this study will be helpful in bringing large, descriptor-less QSARs to mainstream use.

Keywords: LSTM (long short term memory networks); QSAR (quantitative structure-activity relationships); RNN (recurrent neural network); big data; hepatitis (C) virus; machine learning; malaria; mutagenicity.

PubMed Disclaimer

Figures

**Figure 1**
Processing a SMILES code using a unidirectional LSTM network. Activations from the LSTM unit at a particular step is denoted by a^x. a⁰ is the initial activation and is an array of zeroes. The output is the predicted probability which can be converted to an active/inactive format using a decision threshold.

**Figure 2**
Identification of the mutagenicity structural alert for a query compound using the attention values on the MLNCT code processed through an attention-based bidirectional LSTM model.

**Figure 3**
ROC plots for the Ames mutagenicity external test set predictions.

**Figure 4**
Performance of the mutagenicity models for groups within 1,942 external set compounds with varying similarity with the 17,005 training set chemicals. Each step in the horizontal axis is composed of 50 test compounds. The confidence interval bands around the lines were obtained using a bootstrap resampling process.

**Figure 5**
Predicted probability distribution plots for the actives and inactive compounds in the Ames mutagenicity external test set.

**Figure 6**
ROC plots for the Hepatitis C Virus external test set predictions.

**Figure 7**
Performance of the Hepatitis C Virus models for igroups within 3,547 external set compounds with different similarity with the 31,919 training set chemicals. Each step in the horizontal axis is composed of 100 test compounds. The confidence interval bands around the lines were obtained using a bootstrap resampling process.

**Figure 8**
Performance of the *P. falciparum* models for groups within 1,966 external set compounds with different similarity with the 7,866 training set chemicals. Each step in the horizontal axis is composed of 50 test compounds. The confidence interval bands around the lines were obtained using a bootstrap resampling process.

**Figure 9**
Comparison of prediction performance of LSTM models built using canonical and randomized SMILES. Hepatitis C data test set was used.

See this image and copyright information in PMC

Cited by

Designing optimized drug candidates with Generative Adversarial Network.
Abbasi M, Santos BP, Pereira TC, Sofia R, Monteiro NRC, Simões CJV, Brito RMM, Ribeiro B, Oliveira JL, Arrais JP. Abbasi M, et al. J Cheminform. 2022 Jun 26;14(1):40. doi: 10.1186/s13321-022-00623-6. J Cheminform. 2022. PMID: 35754029 Free PMC article.
Deep Neural Networks for QSAR.
Xu Y. Xu Y. Methods Mol Biol. 2022;2390:233-260. doi: 10.1007/978-1-0716-1787-8_10. Methods Mol Biol. 2022. PMID: 34731472
Artificial Intelligence in Drug Discovery: A Comprehensive Review of Data-driven and Machine Learning Approaches.
Kim H, Kim E, Lee I, Bae B, Park M, Nam H. Kim H, et al. Biotechnol Bioprocess Eng. 2020;25(6):895-930. doi: 10.1007/s12257-020-0049-y. Epub 2021 Jan 7. Biotechnol Bioprocess Eng. 2020. PMID: 33437151 Free PMC article. Review.
Artificial Intelligence for COVID-19 Drug Discovery and Vaccine Development.
Keshavarzi Arshadi A, Webb J, Salem M, Cruz E, Calad-Thomson S, Ghadirian N, Collins J, Diez-Cecilia E, Kelly B, Goodarzi H, Yuan JS. Keshavarzi Arshadi A, et al. Front Artif Intell. 2020 Aug 18;3:65. doi: 10.3389/frai.2020.00065. eCollection 2020. Front Artif Intell. 2020. PMID: 33733182 Free PMC article. Review.
Quantitative Structure-Toxicity Relationship in Bioactive Molecules from a Conceptual DFT Perspective.
Pal R, Patra SG, Chattaraj PK. Pal R, et al. Pharmaceuticals (Basel). 2022 Nov 10;15(11):1383. doi: 10.3390/ph15111383. Pharmaceuticals (Basel). 2022. PMID: 36355555 Free PMC article. Review.

See all "Cited by" articles

References

1. Abadi M., Agarwal A., Barham P., Brevdo E., Chen Z., Citro C., et al. (2015). TensorFlow: LARGE-Scale Machine Learning on Heterogeneous Systems. Available online at: https://www.tensorflow.org (accessed April 28, 2019).
1. Alessandro L., Gianluca P., Pierre B. J. (2013). Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. J. Chem. Inf. Model. 53, 1563–1575. 10.1021/ci400187y - DOI - PMC - PubMed
1. Alves V., Muratov E., Capuzzi S., Politi R., Low Y., Braga R., et al. . (2016). Alarms about structural alerts. Green Chem. 18, 4348–4360. 10.1039/C6GC01492E - DOI - PMC - PubMed
1. Bahdanau D., Cho K., Bengio Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv [Preprint]. Available online at: https://arxiv.org/abs/1409.0473
1. Benigni R. (2004). Chemical structure of mutagens and carcinogens and the relationship with biological activity. J. Exp. Clin. Cancer Res. 23, 5–8. - PubMed

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Descriptor Free QSAR Modeling Using Deep Learning With Long Short-Term Memory Neural Networks

Affiliation

Descriptor Free QSAR Modeling Using Deep Learning With Long Short-Term Memory Neural Networks

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Research Materials