Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct 28;16(1):9511.
doi: 10.1038/s41467-025-64511-x.

Benchmarking cell type and gene set annotation by large language models with AnnDictionary

Collaborators, Affiliations

Benchmarking cell type and gene set annotation by large language models with AnnDictionary

George Crowley et al. Nat Commun. .

Abstract

We develop an open-source package called AnnDictionary to facilitate the parallel, independent analysis of multiple anndata. AnnDictionary is built on top of LangChain and AnnData and supports all common large language model (LLM) providers. AnnDictionary only requires 1 line of code to configure or switch the LLM backend and it contains numerous multithreading optimizations to support the analysis of many anndata and large anndata. We use AnnDictionary to perform the first benchmarking study of all major LLMs at de novo cell-type annotation. LLMs vary greatly in absolute agreement with manual annotation based on model size. Inter-LLM agreement also varies with model size. We find that LLM annotation of most major cell types to be more than 80-90% accurate, and will maintain a leaderboard of LLM cell type annotation. Furthermore, we benchmark these LLMs at functional annotation of gene sets, and find that Claude 3.5 Sonnet recovers close matches of functional gene set annotations in over 80% of test sets.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of AnnDictionary and sample LLM cell type annotations.
A Overview of AnnDictionary—a Python package built on top of LangChain and AnnData, with the goal of independently processing multiple anndata in parallel. B Example LLM annotations of cell types and coarse manual annotations for all cells detected in the blood of Tabula Sapiens v2. Colored by cell type annotation.
Fig. 2
Fig. 2. LLM Cell type annotation performance.
LLM cell type annotation quality as compared to manual annotation, rated by an LLM at three levels: 1) perfect, 2) partial, and 3) non-matching, and two resolutions: (A) cells and (B) by cell type. Inter-rater reliability measured as pairwise kappa between each LLM (C). Mean and (D) Standard deviation. All metrics are shown as mean and standard deviation across five replicates. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. LLM Annotation performance for the most abundant cell types.
A Agreement with manual annotation of top-performing LLMs for the ten largest cell types by population size in Tabula Sapiens v2. As in Fig. 2, agreement was assessed at two levels: binary (yes/no, top) and perfect match (bottom), and measured as mean and standard deviation across five replicates. Source data are provided as a Source Data file. For the two large cell types that disagreed with manual annotation the most: LLM annotations for cells manually annotated as (B) basal cells and (D) stromal cells of the ovary; and gene module scores for marker genes of the manually annotated cell type vs. marker genes for the mode LLM annotation: (C) Basal cell and Epithelial cell scores. E Stromal Cell and Granulosa Cell scores.
Fig. 4
Fig. 4. Qualitative assessment of annotation confidence.
A Inter-rater agreement within the top 4 performing LLMs vs. agreement with manual annotation for each manual cell type annotation, with marginal kernel density estimates stratified by tertile of cell type population size. Red, yellow, and green represent the bottom, middle, and top tertiles of cell type by population size, respectively. B Same set of axes as (A), with dot sizes scaled by their respective cell type populations size, and with kernel density estimates scaled by population size as well. The manually drawn ellipses outline two regions of interest: (A) the cell types with the highest inter-rater agreement and lowest agreement with manual annotation—which are the subject of Fig. 5, and (B) the cell types with the highest inter-rater agreement and highest agreement with manual annotation—which includes the most abundant cell types discussed earlier.
Fig. 5
Fig. 5. Cell types with high inter-LLM agreement and low manual agreement.
A For the 10 cell types closest to the top-left corner of the scatterplot in Fig. 4A, a confusion matrix of top-performing LLM annotations and corresponding manual annotations, with a red box around the largest cell type by abundance present in this group (phagocytes). The color bar represents the proportion of cells from each category of manual annotation that are in each category of LLM annotation. Thus, each row sums to 1. B Macrophage, monocyte, and dendritic cell module scores derived using canonical marker genes for cells manually annotated as phagocytes. C UMAP visualization of the module scores in (B).

References

    1. Fan, J., Slowikowski, K. & Zhang, F. Single-cell transcriptomics in cancer: computational challenges and opportunities. Exp. Mol. Med.52, 1452–1465 (2020). - PMC - PubMed
    1. Van de Sande, B. et al. Applications of single-cell RNA sequencing in drug discovery and development. Nat. Rev. Drug Discov.22, 496–520 (2023). - PMC - PubMed
    1. Papalexi, E. & Satija, R. Single-cell RNA sequencing to explore immune cell heterogeneity. Nat. Rev. Immunol.18, 35–45 (2018). - PubMed
    1. Hou, W. & Ji, Z. Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis. Nat. Methods 21, 1462–1465 (2024). - PMC - PubMed
    1. Hu, M. et al. Evaluation of large language models for discovery of gene set function. Nat. Methods22, 82–91 (2025). - PMC - PubMed

LinkOut - more resources