Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Dec 13:2023.04.16.537094.
doi: 10.1101/2023.04.16.537094.

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Affiliations

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Wenpin Hou et al. bioRxiv. .

Update in

Abstract

Cell type annotation is an essential step in single-cell RNA-seq analysis. However, it is a time-consuming process that often requires expertise in collecting canonical marker genes and manually annotating cell types. Automated cell type annotation methods typically require the acquisition of high-quality reference datasets and the development of additional pipelines. We assessed the performance of GPT-4, a highly potent large language model, for cell type annotation, and demonstrated that it can automatically and accurately annotate cell types by utilizing marker gene information generated from standard single-cell RNA-seq analysis pipelines. Evaluated across hundreds of tissue types and cell types, GPT-4 generates cell type annotations exhibiting strong concordance with manual annotations and has the potential to considerably reduce the effort and expertise needed in cell type annotation. We also developed GPTCelltype, an open-source R software package to facilitate cell type annotation by GPT-4.

PubMed Disclaimer

Conflict of interest statement

Competing Interests All authors declare no competing interests.

Figures

Figure 1.
Figure 1.
a, Diagram comparing cell type annotations by human experts, GPT-4, and other automated methods. b, An example showing GPT-4 prompts and answers for annotating human prostate cells with increasing granularity. c, An example showing GPT-4 prompts and answers for annotating single cell types (first two cell types), mixed cell types (third cell type), and new cell types (fourth cell type). d, Datasets included in this study. Datasets generated before Sep 2021, the cutoff date of GPT-4’s training corpus, were highlighted in blue, and others are highlighted in pink. e, Agreement between original and GPT-4 annotations in identifying cell types of human prostate cells.
Figure 2.
Figure 2.
Performance evaluation of GPT-4 in cell type annotation. a, Average agreement scores with different numbers of top differential genes (left), with different statistical methods for differential analysis (middle), and with different prompt strategies (right). b, Proportion of cell types with different levels of agreement in each study and tissue, in top five most abundant broad cell types and malignant cells, in cell populations with different numbers of cells, and in major cell types or cell subtypes (from top to bottom). Average agreement scores are shown as black dots. c, log2- transformed ratio of averaged type I collagen gene expression (COL1A1,COL1A2) and type II collagen gene expression (COL2A1). d–e, Average agreement score (d) and running time (e) comparing different methods in each dataset. Each boxplot shows the distribution (center: median; bounds of box: 1st and 3rd quartiles; bounds of whiskers: data points within 1.5 IQR from the box; minima; maxima) of running time. f, The financial cost of querying GPT-4 API versus the number of cell types in each tissue and study. r represents Pearson correlation. g, Performance of GPT-4 identifying mixed and single cell types, known and unknown cell types, and with different levels of subsampling and noise. Each dot represents one round of simulation. h, Reproducibility of GPT-4 annotations. Each dot represents one cell type. i, Consistency of agreement scores between GPT-4 of June 13, 2023 version and GPT-4 of March 23, 2023 version. The numbers and colors in the plot represent the quantity of cell types from all relevant studies categorized accordingly.

References

    1. Tang F. et al. mrna-seq whole-transcriptome analysis of a single cell. Nat. methods 6, 377–382 (2009). - PubMed
    1. Tang F. et al. Tracing the derivation of embryonic stem cells from the inner cell mass by single-cell rna-seq analysis. Cell stem cell 6, 468–478 (2010). - PMC - PubMed
    1. Hao Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587 (2021). - PMC - PubMed
    1. Wolf F. A., Angerer P. & Theis F. J. Scanpy: large-scale single-cell gene expression data analysis. Genome biology 19, 1–5 (2018). - PMC - PubMed
    1. Abdelaal T. et al. A comparison of automatic cell identification methods for single-cell rna sequencing data. Genome biology 20, 1–19 (2019). - PMC - PubMed

Publication types