. 2024 Jun 12;7(1):134.

doi: 10.1038/s42004-024-01220-4.

The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications

Scott H Snyder¹, Patricia A Vignaux¹, Mustafa Kemal Ozalp¹, Jacob Gerlach¹, Ana C Puhl¹, Thomas R Lane¹, John Corbett¹, Fabio Urbina², Sean Ekins³

Affiliations

¹ Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA.
² Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA. fabio@collaborationspharma.com.
³ Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA. sean@collaborationspharma.com.

PMID: 38866916
PMCID: PMC11169557
DOI: 10.1038/s42004-024-01220-4

The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications

Scott H Snyder et al. Commun Chem. 2024.

. 2024 Jun 12;7(1):134.

doi: 10.1038/s42004-024-01220-4.

Authors

Scott H Snyder¹, Patricia A Vignaux¹, Mustafa Kemal Ozalp¹, Jacob Gerlach¹, Ana C Puhl¹, Thomas R Lane¹, John Corbett¹, Fabio Urbina², Sean Ekins³

Affiliations

¹ Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA.
² Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA. fabio@collaborationspharma.com.
³ Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA. sean@collaborationspharma.com.

PMID: 38866916
PMCID: PMC11169557
DOI: 10.1038/s42004-024-01220-4

Abstract

Recent advances in machine learning (ML) have led to newer model architectures including transformers (large language models, LLMs) showing state of the art results in text generation and image analysis as well as few-shot learning (FSLC) models which offer predictive power with extremely small datasets. These new architectures may offer promise, yet the 'no-free lunch' theorem suggests that no single model algorithm can outperform at all possible tasks. Here, we explore the capabilities of classical (SVR), FSLC, and transformer models (MolBART) over a range of dataset tasks and show a 'goldilocks zone' for each model type, in which dataset size and feature distribution (i.e. dataset "diversity") determines the optimal algorithm strategy. When datasets are small ( < 50 molecules), FSLC tend to outperform both classical ML and transformers. When datasets are small-to-medium sized (50-240 molecules) and diverse, transformers outperform both classical models and few-shot learning. Finally, when datasets are of larger and of sufficient size, classical models then perform the best, suggesting that the optimal model to choose likely depends on the dataset available, its size and diversity. These findings may help to answer the perennial question of which ML algorithm is to be used when faced with a new dataset.

PubMed Disclaimer

Conflict of interest statement

S.E. is the owner and all others are employees of Collaborations Pharmaceuticals, Inc.

Figures

**Fig. 1. Correlation plots between MolBART R², SVR R², the molecular diversity of the training set, and the number of molecules in the training dataset.**
First, each of these metrics was calculated for each individual-target dataset in the original 2401 set from ChEMBL. The correlation was then calculated for each metric.

**Fig. 2. Comparison of MolBART and SVR model R² with dataset size and molecule diversity.**
A R² vs. the number of molecules for each of the 2401 training datasets from ChEMBL. Each point is a single-target dataset. B R² vs. the diversity for each of the 2401 training datasets. Each point is a single-target dataset.

**Fig. 3. Example true vs. predicted -log(M) values for a small and large dataset for MolBART and SVR.**
Perfect predictions would appear along the central gray line.

**Fig. 4. Sample CSFP plot curves of a non-diverse dataset (MAP Kinase MNK1) and a diverse dataset (Protein Kinase C).**
These images illustrate how the graph changes as diversity decreases. A perfectly diverse dataset with all unique scaffolds would be a straight diagonal line, while a dataset comprised of only one scaffold would encapsulate the entire area of the plot.

**Fig. 5. Correlation plots between MolBART R²–SVR R², molecular diversity per training set (diversity), and the number of molecules per training set.**
Each dot corresponds to a different target dataset. Correlations were significant between each paired feature (p < 0.05). Each datapoint represents a single-target dataset.

**Fig. 6. Developing a Test Set for MARK1 Inhibition.**
A The MedChemExpress FDA-Approved and Pharmacopeial Drug Library (HY-L066) was screened for MARK1 inhibition using the Promega ADP-Glo Kinase Assay at a concentration of 385 µM. Compounds that exhibited >90% inhibition are shown in light blue. B The Z-factor for each of the nine plates used in the screen. C IC₅₀ value determination for five novel MARK1 inhibitors using Z’-LYTE assay. Non-linear regression analysis (3-parameters) was performed in GraphPad Prism. Error bars are standard deviation.

**Fig. 7. t-SNE plots of the MACCS key fingerprints of the kinase datasets and the discovered MARK1 inhibitors.**
A Chemical space overlap of the kinase datasets. B Chemical space overlap of the discovered MARK1 inhibitors vs. the MARK1 dataset and the remaining kinase datasets.

See this image and copyright information in PMC

Cited by

Predicting the Hallucinogenic Potential of Molecules Using Artificial Intelligence.
Urbina F, Jones T, Harris JS, Snyder SH, Lane TR, Ekins S. Urbina F, et al. ACS Chem Neurosci. 2024 Aug 21;15(16):3078-3089. doi: 10.1021/acschemneuro.4c00405. Epub 2024 Aug 2. ACS Chem Neurosci. 2024. PMID: 39092989 Free PMC article.
MHNfs: Prompting In-Context Bioactivity Predictions for Low-Data Drug Discovery.
Schimunek J, Luukkonen S, Klambauer G. Schimunek J, et al. J Chem Inf Model. 2025 May 12;65(9):4243-4250. doi: 10.1021/acs.jcim.4c02373. Epub 2025 Apr 30. J Chem Inf Model. 2025. PMID: 40302701 Free PMC article.
Comparative analysis of the performance of the large language models ChatGPT-3.5, ChatGPT-4 and Open AI-o1 in the field of Programmed Cell Death in myeloma.
Kun W, Bo T, Yuntao L, Shenju C, Yanhong L, Shan L, Yun Z, Bo N, Mingxia S. Kun W, et al. Discov Oncol. 2025 May 23;16(1):870. doi: 10.1007/s12672-025-02648-3. Discov Oncol. 2025. PMID: 40407967 Free PMC article.
A meta-learning approach for selectivity prediction in asymmetric catalysis.
Singh S, Hernández-Lobato JM. Singh S, et al. Nat Commun. 2025 Apr 16;16(1):3599. doi: 10.1038/s41467-025-58854-8. Nat Commun. 2025. PMID: 40234410 Free PMC article.
Adjusted imbalance ratio leads to effective AI-based drug discovery against infectious disease.
Masmoudi O, Abdelkrim A, Harigua-Souiai E. Masmoudi O, et al. Sci Rep. 2025 Aug 12;15(1):29563. doi: 10.1038/s41598-025-15265-5. Sci Rep. 2025. PMID: 40796603 Free PMC article.

See all "Cited by" articles

References

1. Ekins S, et al. Exploiting machine learning for end-to-end drug discovery and development. Nat. Mater. 2019;18:435–441. doi: 10.1038/s41563-019-0338-z. - DOI - PMC - PubMed
1. Ekins, S., Lane, T. R., Urbina, F. & Puhl A. C. In silico ADME/tox comes of age: twenty years later. Xenobiotica 1–7, 10.1080/00498254.2023.2245049 (2023). - PMC - PubMed
1. Cheng F, Li W, Liu G, Tang Y. In silico ADMET prediction: recent advances, current challenges and future trends. Curr. Top. Med. Chem. 2013;13:1273–1289. doi: 10.2174/15680266113139990033. - DOI - PubMed
1. Zhavoronkov A, et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol. 2019;37:1038–1040. doi: 10.1038/s41587-019-0224-x. - DOI - PubMed
1. Ekins S, Mestres J, Testa B. In silico pharmacology for drug discovery: applications to targets and beyond. Br. J. Pharm. 2007;152:21–37. doi: 10.1038/sj.bjp.0707306. - DOI - PMC - PubMed

Grants and funding

R44 ES031038/ES/NIEHS NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications

Affiliations

The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources