Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 12;7(1):134.
doi: 10.1038/s42004-024-01220-4.

The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications

Affiliations

The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications

Scott H Snyder et al. Commun Chem. .

Abstract

Recent advances in machine learning (ML) have led to newer model architectures including transformers (large language models, LLMs) showing state of the art results in text generation and image analysis as well as few-shot learning (FSLC) models which offer predictive power with extremely small datasets. These new architectures may offer promise, yet the 'no-free lunch' theorem suggests that no single model algorithm can outperform at all possible tasks. Here, we explore the capabilities of classical (SVR), FSLC, and transformer models (MolBART) over a range of dataset tasks and show a 'goldilocks zone' for each model type, in which dataset size and feature distribution (i.e. dataset "diversity") determines the optimal algorithm strategy. When datasets are small ( < 50 molecules), FSLC tend to outperform both classical ML and transformers. When datasets are small-to-medium sized (50-240 molecules) and diverse, transformers outperform both classical models and few-shot learning. Finally, when datasets are of larger and of sufficient size, classical models then perform the best, suggesting that the optimal model to choose likely depends on the dataset available, its size and diversity. These findings may help to answer the perennial question of which ML algorithm is to be used when faced with a new dataset.

PubMed Disclaimer

Conflict of interest statement

S.E. is the owner and all others are employees of Collaborations Pharmaceuticals, Inc.

Figures

Fig. 1
Fig. 1. Correlation plots between MolBART R2, SVR R2, the molecular diversity of the training set, and the number of molecules in the training dataset.
First, each of these metrics was calculated for each individual-target dataset in the original 2401 set from ChEMBL. The correlation was then calculated for each metric.
Fig. 2
Fig. 2. Comparison of MolBART and SVR model R2 with dataset size and molecule diversity.
A R2 vs. the number of molecules for each of the 2401 training datasets from ChEMBL. Each point is a single-target dataset. B R2 vs. the diversity for each of the 2401 training datasets. Each point is a single-target dataset.
Fig. 3
Fig. 3. Example true vs. predicted -log(M) values for a small and large dataset for MolBART and SVR.
Perfect predictions would appear along the central gray line.
Fig. 4
Fig. 4. Sample CSFP plot curves of a non-diverse dataset (MAP Kinase MNK1) and a diverse dataset (Protein Kinase C).
These images illustrate how the graph changes as diversity decreases. A perfectly diverse dataset with all unique scaffolds would be a straight diagonal line, while a dataset comprised of only one scaffold would encapsulate the entire area of the plot.
Fig. 5
Fig. 5. Correlation plots between MolBART R2–SVR R2, molecular diversity per training set (diversity), and the number of molecules per training set.
Each dot corresponds to a different target dataset. Correlations were significant between each paired feature (p < 0.05). Each datapoint represents a single-target dataset.
Fig. 6
Fig. 6. Developing a Test Set for MARK1 Inhibition.
A The MedChemExpress FDA-Approved and Pharmacopeial Drug Library (HY-L066) was screened for MARK1 inhibition using the Promega ADP-Glo Kinase Assay at a concentration of 385 µM. Compounds that exhibited >90% inhibition are shown in light blue. B The Z-factor for each of the nine plates used in the screen. C IC50 value determination for five novel MARK1 inhibitors using Z’-LYTE assay. Non-linear regression analysis (3-parameters) was performed in GraphPad Prism. Error bars are standard deviation.
Fig. 7
Fig. 7. t-SNE plots of the MACCS key fingerprints of the kinase datasets and the discovered MARK1 inhibitors.
A Chemical space overlap of the kinase datasets. B Chemical space overlap of the discovered MARK1 inhibitors vs. the MARK1 dataset and the remaining kinase datasets.

Similar articles

Cited by

References

    1. Ekins S, et al. Exploiting machine learning for end-to-end drug discovery and development. Nat. Mater. 2019;18:435–441. doi: 10.1038/s41563-019-0338-z. - DOI - PMC - PubMed
    1. Ekins, S., Lane, T. R., Urbina, F. & Puhl A. C. In silico ADME/tox comes of age: twenty years later. Xenobiotica 1–7, 10.1080/00498254.2023.2245049 (2023). - PMC - PubMed
    1. Cheng F, Li W, Liu G, Tang Y. In silico ADMET prediction: recent advances, current challenges and future trends. Curr. Top. Med. Chem. 2013;13:1273–1289. doi: 10.2174/15680266113139990033. - DOI - PubMed
    1. Zhavoronkov A, et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol. 2019;37:1038–1040. doi: 10.1038/s41587-019-0224-x. - DOI - PubMed
    1. Ekins S, Mestres J, Testa B. In silico pharmacology for drug discovery: applications to targets and beyond. Br. J. Pharm. 2007;152:21–37. doi: 10.1038/sj.bjp.0707306. - DOI - PMC - PubMed