Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Sep 6:2:17.
doi: 10.3389/frai.2019.00017. eCollection 2019.

Descriptor Free QSAR Modeling Using Deep Learning With Long Short-Term Memory Neural Networks

Affiliations

Descriptor Free QSAR Modeling Using Deep Learning With Long Short-Term Memory Neural Networks

Suman K Chakravarti et al. Front Artif Intell. .

Abstract

Current practice of building QSAR models usually involves computing a set of descriptors for the training set compounds, applying a descriptor selection algorithm and finally using a statistical fitting method to build the model. In this study, we explored the prospects of building good quality interpretable QSARs for big and diverse datasets, without using any pre-calculated descriptors. We have used different forms of Long Short-Term Memory (LSTM) neural networks to achieve this, trained directly using either traditional SMILES codes or a new linear molecular notation developed as part of this work. Three endpoints were modeled: Ames mutagenicity, inhibition of P. falciparum Dd2 and inhibition of Hepatitis C Virus, with training sets ranging from 7,866 to 31,919 compounds. To boost the interpretability of the prediction results, attention-based machine learning mechanism, jointly with a bidirectional LSTM was used to detect structural alerts for the mutagenicity data set. Traditional fragment descriptor-based models were used for comparison. As per the results of the external and cross-validation experiments, overall prediction accuracies of the LSTM models were close to the fragment-based models. However, LSTM models were superior in predicting test chemicals that are dissimilar to the training set compounds, a coveted quality of QSAR models in real world applications. In summary, it is possible to build QSAR models using LSTMs without using pre-computed traditional descriptors, and models are far from being "black box." We wish that this study will be helpful in bringing large, descriptor-less QSARs to mainstream use.

Keywords: LSTM (long short term memory networks); QSAR (quantitative structure-activity relationships); RNN (recurrent neural network); big data; hepatitis (C) virus; machine learning; malaria; mutagenicity.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Processing a SMILES code using a unidirectional LSTM network. Activations from the LSTM unit at a particular step is denoted by ax. a0 is the initial activation and is an array of zeroes. The output is the predicted probability which can be converted to an active/inactive format using a decision threshold.
Figure 2
Figure 2
Identification of the mutagenicity structural alert for a query compound using the attention values on the MLNCT code processed through an attention-based bidirectional LSTM model.
Figure 3
Figure 3
ROC plots for the Ames mutagenicity external test set predictions.
Figure 4
Figure 4
Performance of the mutagenicity models for groups within 1,942 external set compounds with varying similarity with the 17,005 training set chemicals. Each step in the horizontal axis is composed of 50 test compounds. The confidence interval bands around the lines were obtained using a bootstrap resampling process.
Figure 5
Figure 5
Predicted probability distribution plots for the actives and inactive compounds in the Ames mutagenicity external test set.
Figure 6
Figure 6
ROC plots for the Hepatitis C Virus external test set predictions.
Figure 7
Figure 7
Performance of the Hepatitis C Virus models for igroups within 3,547 external set compounds with different similarity with the 31,919 training set chemicals. Each step in the horizontal axis is composed of 100 test compounds. The confidence interval bands around the lines were obtained using a bootstrap resampling process.
Figure 8
Figure 8
Performance of the P. falciparum models for groups within 1,966 external set compounds with different similarity with the 7,866 training set chemicals. Each step in the horizontal axis is composed of 50 test compounds. The confidence interval bands around the lines were obtained using a bootstrap resampling process.
Figure 9
Figure 9
Comparison of prediction performance of LSTM models built using canonical and randomized SMILES. Hepatitis C data test set was used.

Similar articles

Cited by

References

    1. Abadi M., Agarwal A., Barham P., Brevdo E., Chen Z., Citro C., et al. (2015). TensorFlow: LARGE-Scale Machine Learning on Heterogeneous Systems. Available online at: https://www.tensorflow.org (accessed April 28, 2019).
    1. Alessandro L., Gianluca P., Pierre B. J. (2013). Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. J. Chem. Inf. Model. 53, 1563–1575. 10.1021/ci400187y - DOI - PMC - PubMed
    1. Alves V., Muratov E., Capuzzi S., Politi R., Low Y., Braga R., et al. . (2016). Alarms about structural alerts. Green Chem. 18, 4348–4360. 10.1039/C6GC01492E - DOI - PMC - PubMed
    1. Bahdanau D., Cho K., Bengio Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv [Preprint]. Available online at: https://arxiv.org/abs/1409.0473
    1. Benigni R. (2004). Chemical structure of mutagens and carcinogens and the relationship with biological activity. J. Exp. Clin. Cancer Res. 23, 5–8. - PubMed