Zero-shot evaluation reveals limitations of single-cell foundation models

Kasia Z Kedzierska¹, Lorin Crawford², Ava P Amini², Alex X Lu³

Affiliations

¹ University of Oxford, Oxford, UK.
² Microsoft Research, Cambridge, MA, USA.
³ Microsoft Research, Cambridge, MA, USA. lualex@microsoft.com.

PMID: 40251685
PMCID: PMC12007350
DOI: 10.1186/s13059-025-03574-x

Zero-shot evaluation reveals limitations of single-cell foundation models

Kasia Z Kedzierska et al. Genome Biol. 2025.

. 2025 Apr 18;26(1):101.

doi: 10.1186/s13059-025-03574-x.

Authors

Kasia Z Kedzierska¹, Lorin Crawford², Ava P Amini², Alex X Lu³

Affiliations

¹ University of Oxford, Oxford, UK.
² Microsoft Research, Cambridge, MA, USA.
³ Microsoft Research, Cambridge, MA, USA. lualex@microsoft.com.

PMID: 40251685
PMCID: PMC12007350
DOI: 10.1186/s13059-025-03574-x

Abstract

Foundation models such as scGPT and Geneformer have not been rigorously evaluated in a setting where they are used without any further training (i.e., zero-shot). Understanding the performance of models in zero-shot settings is critical to applications that exclude the ability to fine-tune, such as discovery settings where labels are unknown. Our evaluation of the zero-shot performance of Geneformer and scGPT suggests that, in some cases, these models may face reliability challenges and could be outperformed by simpler methods. Our findings underscore the importance of zero-shot evaluations in development and deployment of foundation models in single-cell research.

Keywords: Foundation models; Machine learning; Single-cell.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethical approval and consent to participate: Not applicable. Competing interests: A.X.L., L.C., and A.P.A. are employees of and hold equity in Microsoft.

Figures

**Fig. 1**
Evaluation of the cell embedding space generated by the models. A Overview of the evaluation setup. We compare Geneformer and scGPT to scVI, Harmony, and the selection of highly variable genes (HVG) on five diverse datasets. B Average BIO score for HVG and embeddings from Harmony, scVI, scGPT, and Geneformer. C, D Visualization of the UMAP projections of the Pancreas (16k) dataset using the cell embedding space generated by the models. Cells are color-coded by cell type (C) and batch (D). E Average batch score for HVG and embeddings from Harmony, scVI, scGPT, and Geneformer. Dashed line in B and E signifies the median calculated across the datasets

**Fig. 2**
Performance comparison of scGPT and Geneformer in gene expression reconstruction. A–C Reconstruction of the expression in the Immune (330k) dataset: A scGPT gene expression prediction (GEP) under the masked language modeling (MLM) objective. B scGPT gene expression prediction from cell embeddings (GEPC). C Geneformer MLM output of the predicted expression ranking (y-axis) versus the true input expression ranking (x-axis). D Mean squared error (MSE) comparison for scGPT objectives. Mean and standard deviation range are shown as points and solid lines, respectively, and a median MSE for mean-based reconstruction is shown as a dashed line. E Pearson’s correlation of input and predicted expression ranking for both Geneformer and for average gene rankings. Mean and standard deviation range are shown as points and solid lines, respectively, and a median correlation for average ranking is shown as a dashed line

See this image and copyright information in PMC

References

1. Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, et al. On the opportunities and risks of foundation models. arXiv. 2022. ArXiv:2108.07258 [cs]. 10.48550/arXiv.2108.07258.
1. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. arXiv. 2020. ArXiv:2005.14165 [cs]. 10.48550/arXiv.2005.14165.
1. Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, et al. Zero-shot text-to-image generation. arXiv. 2021. ArXiv:2102.12092 [cs]. 10.48550/arXiv.2102.12092.
1. Program CSCB, Abdulla S, Aevermann B, Assis P, Badajoz S, Bell SM, et al. CZ CELLGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data. bioRxiv. 2023. Pages: 2023.10.30.563174 Section: New Results. 10.1101/2023.10.30.563174.
1. Yang F, Wang W, Wang F, Fang Y, Tang D, Huang J, et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell. 2022;4(10):852–66. Number: 10 Publisher: Nature Publishing Group. 10.1038/s42256-022-00534-z.

MeSH terms

Actions
Actions
Actions

Grants and funding

203141/Z/16/Z/WT_/Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources
- BioMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Zero-shot evaluation reveals limitations of single-cell foundation models

Affiliations

Zero-shot evaluation reveals limitations of single-cell foundation models

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources