. 2025 Oct 28;16(1):9511.

doi: 10.1038/s41467-025-64511-x.

Benchmarking cell type and gene set annotation by large language models with AnnDictionary

George Crowley¹; Tabula Sapiens Consortium; Stephen R Quake^{2

3

4}

Collaborators, Affiliations

Collaborators

Tabula Sapiens Consortium:
Robert C Jones, Mark Krasnow, Angela Oliveira Pisco, Julia Salzman, Nir Yosef, Siyu He, Madhav Mantri, Jessie Aguirre, Ron Garner, Sal Guerrero, William Harper, Resham Irfan, Sophia Mahfouz, Ravi Ponnusamy, Bhavani A Sanagavarapu, Ahmad Salehi, Ivan Sampson, Chloe Tang, Alan G Cheng, James M Gardner, Burnett Kelly, Thurman Slone, Zifa Wang, Anika Choudhury, Sheela Crasta, Chen Dong, Marcus L Forst, Douglas E Henze, Jaeyoon Lee, Maurizio Morri, Serena Y Tan, Sevahn K Vorperian, Lynn Yang, Marcela Alcántara-Hernádez, Julian Berg, Dhruv Bhatt, Sara Billings, Andrès Gottfried-Blackmore, Jamie Bozeman, Simon Bucher, Elisa Caffrey, Amber Casillas, Rebecca Chen, Matthew Choi, Rebecca N Culver, Ivana Cvijovic, Ke Ding, Hala Shakib Dhowre, Hua Dong, Kenneth Donaville, Lauren Duan, Xiaochen Fan, Mariko H Foecke, Francisco X Galdos, Eliza A Gaylord, Karen Gonzales, William R Goodyer, Michelle Griffin, Yuchao Gu, Shuo Han, Jun Yan He, Paul Heinrich, Rebeca Arroyo Hornero, Keliana Hui, Juan C Irwin, SoRi Jang, Annie Jensen, Saswati Karmakar, Jengmin Kang, Hailey Kang, Soochi Kim, Stewart J Kim, William Kong, Mallory A Laboulaye, Daniel Lee, Gyehyun Lee, Elise Lelou, Anping Li, Baoxiang Li, Wan-Jin Lu, Hayley Raquer-McKay, Elvira Mennillo, Lindsay Moore, Elena Montauti, Karim Mrouj, Shravani Mukherjee, Patrick Neuhöfer, Saphia Nguyen, Honor Paine, Jennifer B Parker, Julia Pham, Kiet T Phong, Pratima Prabala, Zhen Qi, Joshua Quintanilla, Iulia Rusu, Ali Reza Rais Sadati, Bronwyn Scott, David Seong, Hosu Sin, Hanbing Song, Bikem Soyur, Sean Spencer, Varun R Subramaniam, Michael Swift, Aditi Swarup, Greg Szot, Aris Taychameekiatchai, Emily Trimm, Stefan Veizades, Sivakamasundari Vijayakumar, Kim Chi Vo, Tian Wang, Timothy Wu, Yinghua Xie, William Yue, Zue Zhang, Angela Detweiler, Honey Mekonen, Norma F Neff, Sheryl Paul, Amanda Seng, Jia Yan, Deana Rae Crystal Colburg, Balint Laszlo Forgo, Luca Ghita, Frank McCarthy, Aditi Agrawal, Alina Isakova, Kavita Murthy, Alexandra Psaltis, Wenfei Sun, Kyle Awayan, Pierre Boyeau, Robrecht Cannoodt, Leah Dorman, Samuel D'Souza, Can Ergen, Justin Hong, Harper Hua, Erin McGeever, Antoine de Morree, Luise A Seeker, Alexander J Tarashansky, Astrid Gillich, Taha A Jan, Angela Ling, Abhishek Murti, Nikita Sajai, Ryan M Samuel, Juliane Winkler, Steven E Artandi, Philip A Beachy, Mike F Clarke, Zev Gartner, Linda C Giudice, Franklin W Huang, Juliana Idoyaga, Michael G Kattah, Christin S Kuo, Diana J Laird, Michael T Longaker, Patricia Nguyen, David Y Oh, Thomas A Rando, Kristy Red-Horse, Bruce Wang, Albert Y Wu, Sean M Wu, Bo Yu, James Zou

Affiliations

¹ Department of Bioengineering, Stanford University, Stanford, California, USA.
² Department of Bioengineering, Stanford University, Stanford, California, USA. steve@quake-lab.org.
³ Department of Applied Physics, Stanford University, Stanford, CA, USA. steve@quake-lab.org.
⁴ Chan Zuckerberg Initiative, Redwood City, CA, USA. steve@quake-lab.org.

PMID: 41152246
PMCID: PMC12569162
DOI: 10.1038/s41467-025-64511-x

Benchmarking cell type and gene set annotation by large language models with AnnDictionary

George Crowley et al. Nat Commun. 2025.

. 2025 Oct 28;16(1):9511.

doi: 10.1038/s41467-025-64511-x.

Authors

George Crowley¹; Tabula Sapiens Consortium; Stephen R Quake^{2

3

4}

Collaborators

Tabula Sapiens Consortium:
Robert C Jones, Mark Krasnow, Angela Oliveira Pisco, Julia Salzman, Nir Yosef, Siyu He, Madhav Mantri, Jessie Aguirre, Ron Garner, Sal Guerrero, William Harper, Resham Irfan, Sophia Mahfouz, Ravi Ponnusamy, Bhavani A Sanagavarapu, Ahmad Salehi, Ivan Sampson, Chloe Tang, Alan G Cheng, James M Gardner, Burnett Kelly, Thurman Slone, Zifa Wang, Anika Choudhury, Sheela Crasta, Chen Dong, Marcus L Forst, Douglas E Henze, Jaeyoon Lee, Maurizio Morri, Serena Y Tan, Sevahn K Vorperian, Lynn Yang, Marcela Alcántara-Hernádez, Julian Berg, Dhruv Bhatt, Sara Billings, Andrès Gottfried-Blackmore, Jamie Bozeman, Simon Bucher, Elisa Caffrey, Amber Casillas, Rebecca Chen, Matthew Choi, Rebecca N Culver, Ivana Cvijovic, Ke Ding, Hala Shakib Dhowre, Hua Dong, Kenneth Donaville, Lauren Duan, Xiaochen Fan, Mariko H Foecke, Francisco X Galdos, Eliza A Gaylord, Karen Gonzales, William R Goodyer, Michelle Griffin, Yuchao Gu, Shuo Han, Jun Yan He, Paul Heinrich, Rebeca Arroyo Hornero, Keliana Hui, Juan C Irwin, SoRi Jang, Annie Jensen, Saswati Karmakar, Jengmin Kang, Hailey Kang, Soochi Kim, Stewart J Kim, William Kong, Mallory A Laboulaye, Daniel Lee, Gyehyun Lee, Elise Lelou, Anping Li, Baoxiang Li, Wan-Jin Lu, Hayley Raquer-McKay, Elvira Mennillo, Lindsay Moore, Elena Montauti, Karim Mrouj, Shravani Mukherjee, Patrick Neuhöfer, Saphia Nguyen, Honor Paine, Jennifer B Parker, Julia Pham, Kiet T Phong, Pratima Prabala, Zhen Qi, Joshua Quintanilla, Iulia Rusu, Ali Reza Rais Sadati, Bronwyn Scott, David Seong, Hosu Sin, Hanbing Song, Bikem Soyur, Sean Spencer, Varun R Subramaniam, Michael Swift, Aditi Swarup, Greg Szot, Aris Taychameekiatchai, Emily Trimm, Stefan Veizades, Sivakamasundari Vijayakumar, Kim Chi Vo, Tian Wang, Timothy Wu, Yinghua Xie, William Yue, Zue Zhang, Angela Detweiler, Honey Mekonen, Norma F Neff, Sheryl Paul, Amanda Seng, Jia Yan, Deana Rae Crystal Colburg, Balint Laszlo Forgo, Luca Ghita, Frank McCarthy, Aditi Agrawal, Alina Isakova, Kavita Murthy, Alexandra Psaltis, Wenfei Sun, Kyle Awayan, Pierre Boyeau, Robrecht Cannoodt, Leah Dorman, Samuel D'Souza, Can Ergen, Justin Hong, Harper Hua, Erin McGeever, Antoine de Morree, Luise A Seeker, Alexander J Tarashansky, Astrid Gillich, Taha A Jan, Angela Ling, Abhishek Murti, Nikita Sajai, Ryan M Samuel, Juliane Winkler, Steven E Artandi, Philip A Beachy, Mike F Clarke, Zev Gartner, Linda C Giudice, Franklin W Huang, Juliana Idoyaga, Michael G Kattah, Christin S Kuo, Diana J Laird, Michael T Longaker, Patricia Nguyen, David Y Oh, Thomas A Rando, Kristy Red-Horse, Bruce Wang, Albert Y Wu, Sean M Wu, Bo Yu, James Zou

Affiliations

¹ Department of Bioengineering, Stanford University, Stanford, California, USA.
² Department of Bioengineering, Stanford University, Stanford, California, USA. steve@quake-lab.org.
³ Department of Applied Physics, Stanford University, Stanford, CA, USA. steve@quake-lab.org.
⁴ Chan Zuckerberg Initiative, Redwood City, CA, USA. steve@quake-lab.org.

PMID: 41152246
PMCID: PMC12569162
DOI: 10.1038/s41467-025-64511-x

Abstract

We develop an open-source package called AnnDictionary to facilitate the parallel, independent analysis of multiple anndata. AnnDictionary is built on top of LangChain and AnnData and supports all common large language model (LLM) providers. AnnDictionary only requires 1 line of code to configure or switch the LLM backend and it contains numerous multithreading optimizations to support the analysis of many anndata and large anndata. We use AnnDictionary to perform the first benchmarking study of all major LLMs at de novo cell-type annotation. LLMs vary greatly in absolute agreement with manual annotation based on model size. Inter-LLM agreement also varies with model size. We find that LLM annotation of most major cell types to be more than 80-90% accurate, and will maintain a leaderboard of LLM cell type annotation. Furthermore, we benchmark these LLMs at functional annotation of gene sets, and find that Claude 3.5 Sonnet recovers close matches of functional gene set annotations in over 80% of test sets.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

**Fig. 1. Overview of AnnDictionary and sample LLM cell type annotations.**
A Overview of AnnDictionary—a Python package built on top of LangChain and AnnData, with the goal of independently processing multiple anndata in parallel. B Example LLM annotations of cell types and coarse manual annotations for all cells detected in the blood of Tabula Sapiens v2. Colored by cell type annotation.

**Fig. 2. LLM Cell type annotation performance.**
LLM cell type annotation quality as compared to manual annotation, rated by an LLM at three levels: 1) perfect, 2) partial, and 3) non-matching, and two resolutions: (A) cells and (B) by cell type. Inter-rater reliability measured as pairwise kappa between each LLM (C). Mean and (D) Standard deviation. All metrics are shown as mean and standard deviation across five replicates. Source data are provided as a Source Data file.

**Fig. 3. LLM Annotation performance for the most abundant cell types.**
A Agreement with manual annotation of top-performing LLMs for the ten largest cell types by population size in Tabula Sapiens v2. As in Fig. 2, agreement was assessed at two levels: binary (yes/no, top) and perfect match (bottom), and measured as mean and standard deviation across five replicates. Source data are provided as a Source Data file. For the two large cell types that disagreed with manual annotation the most: LLM annotations for cells manually annotated as (B) basal cells and (D) stromal cells of the ovary; and gene module scores for marker genes of the manually annotated cell type vs. marker genes for the mode LLM annotation: (C) Basal cell and Epithelial cell scores. E Stromal Cell and Granulosa Cell scores.

**Fig. 4. Qualitative assessment of annotation confidence.**
A Inter-rater agreement within the top 4 performing LLMs vs. agreement with manual annotation for each manual cell type annotation, with marginal kernel density estimates stratified by tertile of cell type population size. Red, yellow, and green represent the bottom, middle, and top tertiles of cell type by population size, respectively. B Same set of axes as (A), with dot sizes scaled by their respective cell type populations size, and with kernel density estimates scaled by population size as well. The manually drawn ellipses outline two regions of interest: (A) the cell types with the highest inter-rater agreement and lowest agreement with manual annotation—which are the subject of Fig. 5, and (B) the cell types with the highest inter-rater agreement and highest agreement with manual annotation—which includes the most abundant cell types discussed earlier.

**Fig. 5. Cell types with high inter-LLM agreement and low manual agreement.**
A For the 10 cell types closest to the top-left corner of the scatterplot in Fig. 4A, a confusion matrix of top-performing LLM annotations and corresponding manual annotations, with a red box around the largest cell type by abundance present in this group (phagocytes). The color bar represents the proportion of cells from each category of manual annotation that are in each category of LLM annotation. Thus, each row sums to 1. B Macrophage, monocyte, and dendritic cell module scores derived using canonical marker genes for cells manually annotated as phagocytes. C UMAP visualization of the module scores in (B).

See this image and copyright information in PMC

References

1. Fan, J., Slowikowski, K. & Zhang, F. Single-cell transcriptomics in cancer: computational challenges and opportunities. Exp. Mol. Med.52, 1452–1465 (2020). - PMC - PubMed
1. Van de Sande, B. et al. Applications of single-cell RNA sequencing in drug discovery and development. Nat. Rev. Drug Discov.22, 496–520 (2023). - PMC - PubMed
1. Papalexi, E. & Satija, R. Single-cell RNA sequencing to explore immune cell heterogeneity. Nat. Rev. Immunol.18, 35–45 (2018). - PubMed
1. Hou, W. & Ji, Z. Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis. Nat. Methods 21, 1462–1465 (2024). - PMC - PubMed
1. Hu, M. et al. Evaluation of large language models for discovery of gene set function. Nat. Methods22, 82–91 (2025). - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Benchmarking cell type and gene set annotation by large language models with AnnDictionary

Collaborators

Affiliations

Benchmarking cell type and gene set annotation by large language models with AnnDictionary

Authors

Collaborators

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources