Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models

Dejun Jiang^#^{1

2

3}, Zhenxing Wu^#¹, Chang-Yu Hsieh⁴, Guangyong Chen⁵, Ben Liao⁴, Zhe Wang¹, Chao Shen¹, Dongsheng Cao⁶, Jian Wu⁷, Tingjun Hou^{8

9}

Affiliations

¹ Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China.
² State Key Lab of CAD & CG, Zhejiang University, Hangzhou, 310058, Zhejiang, China.
³ College of Computer Science and Technology, Zhejiang University, Hangzhou, China.
⁴ Tencent Quantum Laboratory Tencent, Shenzhen, 518057, Guangdong, China.
⁵ Shenzhen Institutes of Advanced Technology, Shenzhen, 518055, Guangdong, China.
⁶ Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, 410004, Hunan, China. oriental-cds@163.com.
⁷ College of Computer Science and Technology, Zhejiang University, Hangzhou, China. wujian2000@zju.edu.cn.
⁸ Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China. tingjunhou@zju.edu.cn.
⁹ State Key Lab of CAD & CG, Zhejiang University, Hangzhou, 310058, Zhejiang, China. tingjunhou@zju.edu.cn.

^# Contributed equally.

PMID: 33597034
PMCID: PMC7888189
DOI: 10.1186/s13321-020-00479-8

Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models

Dejun Jiang et al. J Cheminform. 2021.

. 2021 Feb 17;13(1):12.

doi: 10.1186/s13321-020-00479-8.

Authors

Dejun Jiang^#^{1

2

3}, Zhenxing Wu^#¹, Chang-Yu Hsieh⁴, Guangyong Chen⁵, Ben Liao⁴, Zhe Wang¹, Chao Shen¹, Dongsheng Cao⁶, Jian Wu⁷, Tingjun Hou^{8

9}

Affiliations

¹ Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China.
² State Key Lab of CAD & CG, Zhejiang University, Hangzhou, 310058, Zhejiang, China.
³ College of Computer Science and Technology, Zhejiang University, Hangzhou, China.
⁴ Tencent Quantum Laboratory Tencent, Shenzhen, 518057, Guangdong, China.
⁵ Shenzhen Institutes of Advanced Technology, Shenzhen, 518055, Guangdong, China.
⁶ Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, 410004, Hunan, China. oriental-cds@163.com.
⁷ College of Computer Science and Technology, Zhejiang University, Hangzhou, China. wujian2000@zju.edu.cn.
⁸ Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China. tingjunhou@zju.edu.cn.
⁹ State Key Lab of CAD & CG, Zhejiang University, Hangzhou, 310058, Zhejiang, China. tingjunhou@zju.edu.cn.

^# Contributed equally.

PMID: 33597034
PMCID: PMC7888189
DOI: 10.1186/s13321-020-00479-8

Abstract

Graph neural networks (GNN) has been considered as an attractive modelling method for molecular property prediction, and numerous studies have shown that GNN could yield more promising results than traditional descriptor-based methods. In this study, based on 11 public datasets covering various property endpoints, the predictive capacity and computational efficiency of the prediction models developed by eight machine learning (ML) algorithms, including four descriptor-based models (SVM, XGBoost, RF and DNN) and four graph-based models (GCN, GAT, MPNN and Attentive FP), were extensively tested and compared. The results demonstrate that on average the descriptor-based models outperform the graph-based models in terms of prediction accuracy and computational efficiency. SVM generally achieves the best predictions for the regression tasks. Both RF and XGBoost can achieve reliable predictions for the classification tasks, and some of the graph-based models, such as Attentive FP and GCN, can yield outstanding performance for a fraction of larger or multi-task datasets. In terms of computational cost, XGBoost and RF are the two most efficient algorithms and only need a few seconds to train a model even for a large dataset. The model interpretations by the SHAP method can effectively explore the established domain knowledge for the descriptor-based models. Finally, we explored use of these models for virtual screening (VS) towards HIV and demonstrated that different ML algorithms offer diverse VS profiles. All in all, we believe that the off-the-shelf descriptor-based models still can be directly employed to accurately predict various chemical endpoints with excellent computability and interpretability.

Keywords: ADME/T prediction; Deep learning; Ensemble learning; Extreme gradient boosting; Graph neural networks.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
The general workflow of GNN in molecular property prediction

**Fig. 2**
Importance of the representative molecular descriptors (the top 20) and the corresponding SHAP values given by XGBoost for the a ESOL and b BBBP datasets. One molecule gets one dot on each descriptor’s line and dots stack up to show density

**Fig. 3**
The distributions of the prediction scores for the 1960 screened molecules predicted by the four descriptor-based models including a SVM, b XGBoost, c RF, d DNN and the four graph-based models including e GCN, f GAT, g MPNN and h Attentive FP

**Fig. 4**
The heat map of the Euclidean distances of the prediction scores for different model pairs

**Fig. 5**
The structural features of the potential inhibitors given by the four descriptor-based models including a SVM, b XGBoost, c RF and d DNN

**Fig. 6**
The structural features of the potential inhibitors predicted by the four graph-based models including a GCN, b GAT, c (MPNN) and d Attentive FP; e the structure of the known HIV inhibitor identified by all the eight models

See this image and copyright information in PMC

References

1. Hou T, Li Y, Zhang W, et al. Recent developments of in silico predictions of intestinal absorption and oral bioavailability. Comb Chem High Throughput Screening. 2009;12:497–506. - PubMed
1. Basile AO, Yahi A, Tatonetti NP. Artificial intelligence for drug toxicity and safety. Trends Pharmacol Sci. 2019;40:624–635. - PMC - PubMed
1. Xia XY, Maliski EG, Gallant P, et al. Classification of kinase inhibitors using a Bayesian model. J Med Chem. 2004;47:4463–4470. - PubMed
1. Tian S, Wang J, Li Y, et al. Drug-likeness analysis of traditional chinese medicines: prediction of drug-likeness using machine learning approaches. Mol Pharm. 2012;9:2875–2886. - PubMed
1. Li D, Chen L, Li Y, et al. ADMET Evaluation in Drug Discovery. 13. Development of in silico prediction models for P-Glycoprotein Substrates. Mol Pharm. 2014;11:716–726. - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models

Affiliations

Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources