Exploration of chemical space with partial labeled noisy student self-training and self-supervised graph embedding
- PMID: 35501680
- PMCID: PMC9063120
- DOI: 10.1186/s12859-022-04681-3
Exploration of chemical space with partial labeled noisy student self-training and self-supervised graph embedding
Abstract
Background: Drug discovery is time-consuming and costly. Machine learning, especially deep learning, shows great potential in quantitative structure-activity relationship (QSAR) modeling to accelerate drug discovery process and reduce its cost. A big challenge in developing robust and generalizable deep learning models for QSAR is the lack of a large amount of data with high-quality and balanced labels. To address this challenge, we developed a self-training method, Partially LAbeled Noisy Student (PLANS), and a novel self-supervised graph embedding, Graph-Isomorphism-Network Fingerprint (GINFP), for chemical compounds representations with substructure information using unlabeled data. The representations can be used for predicting chemical properties such as binding affinity, toxicity, and others. PLANS-GINFP allows us to exploit millions of unlabeled chemical compounds as well as labeled and partially labeled pharmacological data to improve the generalizability of neural network models.
Results: We evaluated the performance of PLANS-GINFP for predicting Cytochrome P450 (CYP450) binding activity in a CYP450 dataset and chemical toxicity in the Tox21 dataset. The extensive benchmark studies demonstrated that PLANS-GINFP could significantly improve the performance in both cases by a large margin. Both PLANS-based self-training and GINFP-based self-supervised learning contribute to the performance improvement.
Conclusion: To better exploit chemical structures as an input for machine learning algorithms, we proposed a self-supervised graph neural network-based embedding method that can encode substructure information. Furthermore, we developed a model agnostic self-training method, PLANS, that can be applied to any deep learning architectures to improve prediction accuracies. PLANS provided a way to better utilize partially labeled and unlabeled data. Comprehensive benchmark studies demonstrated their potentials in predicting drug metabolism and toxicity profiles using sparse, noisy, and imbalanced data. PLANS-GINFP could serve as a general solution to improve the predictive modeling for QSAR modeling.
Keywords: Artificial intelligence; Chemical embedding; Deep neural network; Drug discovery; Drug metabolism; Drug toxicity; Drug-target interaction; Graph neural network; Self-supervised learning; Semi-supervised learning.
© 2022. The Author(s).
Conflict of interest statement
The authors declare no competing interests.
Figures







Similar articles
-
Deep semi-supervised learning via dynamic anchor graph embedding in latent space.Neural Netw. 2022 Feb;146:350-360. doi: 10.1016/j.neunet.2021.11.026. Epub 2021 Dec 1. Neural Netw. 2022. PMID: 34929418
-
An effective self-supervised framework for learning expressive molecular global representations to drug discovery.Brief Bioinform. 2021 Nov 5;22(6):bbab109. doi: 10.1093/bib/bbab109. Brief Bioinform. 2021. PMID: 33940598
-
Artificial intelligence to deep learning: machine intelligence approach for drug discovery.Mol Divers. 2021 Aug;25(3):1315-1360. doi: 10.1007/s11030-021-10217-3. Epub 2021 Apr 12. Mol Divers. 2021. PMID: 33844136 Free PMC article. Review.
-
A unified deep semi-supervised graph learning scheme based on nodes re-weighting and manifold regularization.Neural Netw. 2023 Jan;158:188-196. doi: 10.1016/j.neunet.2022.11.017. Epub 2022 Nov 19. Neural Netw. 2023. PMID: 36462365
-
A review on machine learning approaches and trends in drug discovery.Comput Struct Biotechnol J. 2021 Aug 12;19:4538-4558. doi: 10.1016/j.csbj.2021.08.011. eCollection 2021. Comput Struct Biotechnol J. 2021. PMID: 34471498 Free PMC article. Review.
Cited by
-
Hierarchical multi-omics data integration and modeling predict cell-specific chemical proteomics and drug responses.Cell Rep Methods. 2023 Apr 17;3(4):100452. doi: 10.1016/j.crmeth.2023.100452. eCollection 2023 Apr 24. Cell Rep Methods. 2023. PMID: 37159671 Free PMC article.
-
E-GuARD: expert-guided augmentation for the robust detection of compounds interfering with biological assays.J Cheminform. 2025 Apr 29;17(1):64. doi: 10.1186/s13321-025-01014-3. J Cheminform. 2025. PMID: 40301942 Free PMC article.
-
Semi-supervised meta-learning elucidates understudied molecular interactions.Commun Biol. 2024 Sep 9;7(1):1104. doi: 10.1038/s42003-024-06797-z. Commun Biol. 2024. PMID: 39251833 Free PMC article.
-
Towards automatic farrowing monitoring-A Noisy Student approach for improving detection performance of newborn piglets.PLoS One. 2024 Oct 2;19(10):e0310818. doi: 10.1371/journal.pone.0310818. eCollection 2024. PLoS One. 2024. PMID: 39356687 Free PMC article.
-
End-to-end sequence-structure-function meta-learning predicts genome-wide chemical-protein interactions for dark proteins.PLoS Comput Biol. 2023 Jan 18;19(1):e1010851. doi: 10.1371/journal.pcbi.1010851. eCollection 2023 Jan. PLoS Comput Biol. 2023. PMID: 36652496 Free PMC article.
References
-
- Rumelhart DE, McClelland JL, PDP Research Group C, editors. Parallel distributed processing: explorations in the microstructure of cognition, vol 1, foundations. Cambridge: MIT Press; 1986.
-
- Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. 2016.
-
- Kingma DP, Welling M. Auto-encoding variational bayes.
-
- Kipf TN, Welling M. Variational graph auto-encoders. 2016.
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources