Traditional Machine and Deep Learning for Predicting Toxicity Endpoints
- PMID: 36615411
- PMCID: PMC9822478
- DOI: 10.3390/molecules28010217
Traditional Machine and Deep Learning for Predicting Toxicity Endpoints
Abstract
Molecular structure property modeling is an increasingly important tool for predicting compounds with desired properties due to the expensive and resource-intensive nature and the problem of toxicity-related attrition in late phases during drug discovery and development. Lately, the interest for applying deep learning techniques has increased considerably. This investigation compares the traditional physico-chemical descriptor and machine learning-based approaches through autoencoder generated descriptors to two different descriptor-free, Simplified Molecular Input Line Entry System (SMILES) based, deep learning architectures of Bidirectional Encoder Representations from Transformers (BERT) type using the Mondrian aggregated conformal prediction method as overarching framework. The results show for the binary CATMoS non-toxic and very-toxic datasets that for the former, almost equally balanced, dataset all methods perform equally well while for the latter dataset, with an 11-fold difference between the two classes, the MolBERT model based on a large pre-trained network performs somewhat better compared to the rest with high efficiency for both classes (0.93-0.94) as well as high values for sensitivity, specificity and balanced accuracy (0.86-0.87). The descriptor-free, SMILES-based, deep learning BERT architectures seem capable of producing well-balanced predictive models with defined applicability domains. This work also demonstrates that the class imbalance problem is gracefully handled through the use of Mondrian conformal prediction without the use of over- and/or under-sampling, weighting of classes or cost-sensitive methods.
Keywords: BERT; CATMoS dataset; CDDD; RDKit; conformal prediction; random forest.
Conflict of interest statement
The author declares no conflict of interest.
Figures





Similar articles
-
Positional embeddings and zero-shot learning using BERT for molecular-property prediction.J Cheminform. 2025 Feb 5;17(1):17. doi: 10.1186/s13321-025-00959-9. J Cheminform. 2025. PMID: 39910649 Free PMC article.
-
Descriptor-Free Deep Learning QSAR Model for the Fraction Unbound in Human Plasma.Mol Pharm. 2023 Oct 2;20(10):4984-4993. doi: 10.1021/acs.molpharmaceut.3c00129. Epub 2023 Sep 1. Mol Pharm. 2023. PMID: 37656906
-
Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration.Research (Wash D C). 2022 Dec 15;2022:0004. doi: 10.34133/research.0004. eCollection 2022. Research (Wash D C). 2022. PMID: 39285949 Free PMC article.
-
Data Integration Using Advances in Machine Learning in Drug Discovery and Molecular Biology.Methods Mol Biol. 2021;2190:167-184. doi: 10.1007/978-1-0716-0826-5_7. Methods Mol Biol. 2021. PMID: 32804365 Review.
-
Advancing drug discovery with deep attention neural networks.Drug Discov Today. 2024 Aug;29(8):104067. doi: 10.1016/j.drudis.2024.104067. Epub 2024 Jun 24. Drug Discov Today. 2024. PMID: 38925473 Review.
Cited by
-
CPSign: conformal prediction for cheminformatics modeling.J Cheminform. 2024 Jun 28;16(1):75. doi: 10.1186/s13321-024-00870-9. J Cheminform. 2024. PMID: 38943219 Free PMC article.
References
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources