Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 18;26(1):101.
doi: 10.1186/s13059-025-03574-x.

Zero-shot evaluation reveals limitations of single-cell foundation models

Affiliations

Zero-shot evaluation reveals limitations of single-cell foundation models

Kasia Z Kedzierska et al. Genome Biol. .

Abstract

Foundation models such as scGPT and Geneformer have not been rigorously evaluated in a setting where they are used without any further training (i.e., zero-shot). Understanding the performance of models in zero-shot settings is critical to applications that exclude the ability to fine-tune, such as discovery settings where labels are unknown. Our evaluation of the zero-shot performance of Geneformer and scGPT suggests that, in some cases, these models may face reliability challenges and could be outperformed by simpler methods. Our findings underscore the importance of zero-shot evaluations in development and deployment of foundation models in single-cell research.

Keywords: Foundation models; Machine learning; Single-cell.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethical approval and consent to participate: Not applicable. Competing interests: A.X.L., L.C., and A.P.A. are employees of and hold equity in Microsoft.

Figures

Fig. 1
Fig. 1
Evaluation of the cell embedding space generated by the models. A Overview of the evaluation setup. We compare Geneformer and scGPT to scVI, Harmony, and the selection of highly variable genes (HVG) on five diverse datasets. B Average BIO score for HVG and embeddings from Harmony, scVI, scGPT, and Geneformer. C, D Visualization of the UMAP projections of the Pancreas (16k) dataset using the cell embedding space generated by the models. Cells are color-coded by cell type (C) and batch (D). E Average batch score for HVG and embeddings from Harmony, scVI, scGPT, and Geneformer. Dashed line in B and E signifies the median calculated across the datasets
Fig. 2
Fig. 2
Performance comparison of scGPT and Geneformer in gene expression reconstruction. AC Reconstruction of the expression in the Immune (330k) dataset: A scGPT gene expression prediction (GEP) under the masked language modeling (MLM) objective. B scGPT gene expression prediction from cell embeddings (GEPC). C Geneformer MLM output of the predicted expression ranking (y-axis) versus the true input expression ranking (x-axis). D Mean squared error (MSE) comparison for scGPT objectives. Mean and standard deviation range are shown as points and solid lines, respectively, and a median MSE for mean-based reconstruction is shown as a dashed line. E Pearson’s correlation of input and predicted expression ranking for both Geneformer and for average gene rankings. Mean and standard deviation range are shown as points and solid lines, respectively, and a median correlation for average ranking is shown as a dashed line

Similar articles

Cited by

References

    1. Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, et al. On the opportunities and risks of foundation models. arXiv. 2022. ArXiv:2108.07258 [cs]. 10.48550/arXiv.2108.07258.
    1. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. arXiv. 2020. ArXiv:2005.14165 [cs]. 10.48550/arXiv.2005.14165.
    1. Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, et al. Zero-shot text-to-image generation. arXiv. 2021. ArXiv:2102.12092 [cs]. 10.48550/arXiv.2102.12092.
    1. Program CSCB, Abdulla S, Aevermann B, Assis P, Badajoz S, Bell SM, et al. CZ CELLformula imageGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data. bioRxiv. 2023. Pages: 2023.10.30.563174 Section: New Results. 10.1101/2023.10.30.563174.
    1. Yang F, Wang W, Wang F, Fang Y, Tang D, Huang J, et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell. 2022;4(10):852–66. Number: 10 Publisher: Nature Publishing Group. 10.1038/s42256-022-00534-z.

LinkOut - more resources