Descriptor-Free Deep Learning QSAR Model for the Fraction Unbound in Human Plasma
- PMID: 37656906
- DOI: 10.1021/acs.molpharmaceut.3c00129
Descriptor-Free Deep Learning QSAR Model for the Fraction Unbound in Human Plasma
Abstract
Chemical-specific parameters are either measured in vitro or estimated using quantitative structure-activity relationship (QSAR) models. The existing body of QSAR work relies on extracting a set of descriptors or fingerprints, subset selection, and training a machine learning model. In this work, we used a state-of-the-art natural language processing model, Bidirectional Encoder Representations from Transformers, which allowed us to circumvent the need for calculation of these chemical descriptors. In this approach, simplified molecular-input line-entry system (SMILES) strings were embedded in a high-dimensional space using a two-stage training approach. The model was first pre-trained on a masked SMILES token task and then fine-tuned on a QSAR prediction task. The pre-training task learned meaningful high-dimensional embeddings based upon the relationships between the chemical tokens in the SMILES strings derived from the "in-stock" portion of the ZINC 15 dataset─a large dataset of commercially available chemicals. The fine-tuning task then perturbed the pre-trained embeddings to facilitate prediction of a specific QSAR endpoint of interest. The power of this model stems from the ability to reuse the pre-trained model for multiple different fine-tuning tasks, reducing the computational burden of developing multiple models for different endpoints. We used our framework to develop a predictive model for fraction unbound in human plasma (fu,p). This approach is flexible, requires minimum domain expertise, and can be generalized for other parameters of interest for rapid and accurate estimation of absorption, distribution, metabolism, excretion, and toxicity.
Keywords: BERT; QSAR; deep learning; fraction unbound; human plasma.
Similar articles
-
Positional embeddings and zero-shot learning using BERT for molecular-property prediction.J Cheminform. 2025 Feb 5;17(1):17. doi: 10.1186/s13321-025-00959-9. J Cheminform. 2025. PMID: 39910649 Free PMC article.
-
Traditional Machine and Deep Learning for Predicting Toxicity Endpoints.Molecules. 2022 Dec 26;28(1):217. doi: 10.3390/molecules28010217. Molecules. 2022. PMID: 36615411 Free PMC article.
-
Extracting comprehensive clinical information for breast cancer using deep learning methods.Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2. Int J Med Inform. 2019. PMID: 31627032
-
Using molecular embeddings in QSAR modeling: does it make a difference?Brief Bioinform. 2022 Jan 17;23(1):bbab365. doi: 10.1093/bib/bbab365. Brief Bioinform. 2022. PMID: 34498670 Review.
-
Ensemble Learning, Deep Learning-Based and Molecular Descriptor-Based Quantitative Structure-Activity Relationships.Molecules. 2023 Mar 6;28(5):2410. doi: 10.3390/molecules28052410. Molecules. 2023. PMID: 36903654 Free PMC article. Review.
Cited by
-
Leveraging machine learning models in evaluating ADMET properties for drug discovery and development.ADMET DMPK. 2025 Jun 7;13(3):2772. doi: 10.5599/admet.2772. eCollection 2025. ADMET DMPK. 2025. PMID: 40585410 Free PMC article. Review.
-
A review of large language models and autonomous agents in chemistry.Chem Sci. 2024 Dec 9;16(6):2514-2572. doi: 10.1039/d4sc03921a. eCollection 2025 Feb 5. Chem Sci. 2024. PMID: 39829984 Free PMC article. Review.
-
Application of Transformers in Cheminformatics.J Chem Inf Model. 2024 Jun 10;64(11):4392-4409. doi: 10.1021/acs.jcim.3c02070. Epub 2024 May 30. J Chem Inf Model. 2024. PMID: 38815246 Free PMC article. Review.
-
Prediction of reproductive and developmental toxicity using an attention and gate augmented graph convolutional network.Sci Rep. 2025 May 25;15(1):18186. doi: 10.1038/s41598-025-02590-y. Sci Rep. 2025. PMID: 40415056 Free PMC article.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources