Traditional Machine and Deep Learning for Predicting Toxicity Endpoints

Ulf Norinder¹

Affiliations

PMID: 36615411
PMCID: PMC9822478
DOI: 10.3390/molecules28010217

Traditional Machine and Deep Learning for Predicting Toxicity Endpoints

Ulf Norinder. Molecules. 2022.

. 2022 Dec 26;28(1):217.

doi: 10.3390/molecules28010217.

Author

Ulf Norinder¹

Affiliation

¹ Department of Computer and Systems Sciences, Stockholm University, 164 07 Kista, Sweden.

PMID: 36615411
PMCID: PMC9822478
DOI: 10.3390/molecules28010217

Abstract

Molecular structure property modeling is an increasingly important tool for predicting compounds with desired properties due to the expensive and resource-intensive nature and the problem of toxicity-related attrition in late phases during drug discovery and development. Lately, the interest for applying deep learning techniques has increased considerably. This investigation compares the traditional physico-chemical descriptor and machine learning-based approaches through autoencoder generated descriptors to two different descriptor-free, Simplified Molecular Input Line Entry System (SMILES) based, deep learning architectures of Bidirectional Encoder Representations from Transformers (BERT) type using the Mondrian aggregated conformal prediction method as overarching framework. The results show for the binary CATMoS non-toxic and very-toxic datasets that for the former, almost equally balanced, dataset all methods perform equally well while for the latter dataset, with an 11-fold difference between the two classes, the MolBERT model based on a large pre-trained network performs somewhat better compared to the rest with high efficiency for both classes (0.93-0.94) as well as high values for sensitivity, specificity and balanced accuracy (0.86-0.87). The descriptor-free, SMILES-based, deep learning BERT architectures seem capable of producing well-balanced predictive models with defined applicability domains. This work also demonstrates that the class imbalance problem is gracefully handled through the use of Mondrian conformal prediction without the use of over- and/or under-sampling, weighting of classes or cost-sensitive methods.

Keywords: BERT; CATMoS dataset; CDDD; RDKit; conformal prediction; random forest.

PubMed Disclaimer

Conflict of interest statement

The author declares no conflict of interest.

Figures

**Figure 3**
Number of valid evaluation set models, at significance levels 0.1, 0.15 and 0.2, for each method (maximum 2). Methods: cddd = RF/cddd 10 models, mg_bert = Molecular-graph-BERT/smiles 10 models, molbert = MolBERT/smiles 10 models, molbert_p = MolBERT/smiles 10 models with PubChem pre-trained model, rdkit = RF/rdkit 10 models, xxx_1 is the corresponding approach based on only 1 model.

**Figure 4**
Evaluation set efficiency for class “1” for the 2 datasets (NT model upper row, VT model lower row), at significance levels 0.1–0.2, for each method. Class “1”: non-toxic class and very toxic class for the 2 datasets nt and vt, respectively. Methods: cddd = RF/cddd 10 models, mg_bert = Molecular-graph-BERT/smiles 10 models, molbert = MolBERT/smiles 10 models, molbert_p = MolBERT/smiles 10 models with PubChem pre-trained model, rdkit = RF/rdkit 10 models, xxx_1 is the corresponding approach based on only 1 model.

**Figure 5**
Evaluation set efficiency for class “0” for the 2 datasets (NT model upper row, VT model lower row), at significance levels 0.1–0.2, for each method. Class “0”: the other binary class for each dataset as compared to Figure 4. Methods: cddd = RF/cddd 10 models, mg_bert = Molecular-graph-BERT/smiles 10 models, molbert = MolBERT/smiles 10 models, molbert_p = MolBERT/smiles 10 models with PubChem pre-trained model, rdkit = RF/rdkit 10 models, xxx_1 is the corresponding approach based on only 1 model.

**Figure 1**
A flow chart overview depiction of the employed machine learning approaches. RdKit and CDDD = RdKit and CDDD descriptor calculation, tr. and val. set. = training and validation set, respectively, eval. and calibr. set. = evaluation and CP calibration set, respectively.

**Figure 2**
Number of valid evaluation set models (maximum 10) for each method type. Methods: cddd = RF/cddd 10 models, mg_bert = Molecular-graph-BERT/smiles 10 models, molbert = MolBERT/smiles 10 models, molbert_p = MolBERT/smiles 10 models with PubChem pre-trained model, rdkit = RF/rdkit 10 models, xxx_1 is the corresponding approach based on only 1 model.

See this image and copyright information in PMC

References

1. DiMasi J.A., Grabowski H.G., Hansen R.W. Innovation in the pharmaceutical industry: New estimates of R&D costs. J. Health Econ. 2016;47:20–33. doi: 10.1016/j.jhealeco.2016.01.012. - DOI - PubMed
1. Hwang T.J., Carpenter D., Lauffenburger J., Wang B., Franklin J.M., Kesselheim A. Failure of Investigational Drugs in Late-Stage Clinical Development and Publication of Trial Results. JAMA Intern. Med. 2016;176:1826–1833. doi: 10.1001/jamainternmed.2016.6008. - DOI - PubMed
1. Schaduangrat N., Lampa S., Simeon S., Gleeson M.P., Spjuth O., Nantasenamat C. Towards reproducible computational drug discovery. J. Cheminform. 2020;12:9. doi: 10.1186/s13321-020-0408-x. - DOI - PMC - PubMed
1. Sabe V.T., Ntombela T., Jhamba L.A., Maguire G.E., Govender T., Naicker T., Kruger H.G. Current trends in computer aided drug design and a highlight of drugs discovered via computational techniques: A review. Eur. J. Med. Chem. 2021;224:113705. doi: 10.1016/j.ejmech.2021.113705. - DOI - PubMed
1. Lin X., Li X., Lin X. A Review on Applications of Computational Methods in Drug Screening and Design. Molecules. 2020;25:1375. doi: 10.3390/molecules25061375. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions

Grants and funding

2018/11/Swedish Foundation for Strategic Environmental Research

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Traditional Machine and Deep Learning for Predicting Toxicity Endpoints

Affiliation

Traditional Machine and Deep Learning for Predicting Toxicity Endpoints

Author

Affiliation

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources