Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Oct 20:19:5874-5887.
doi: 10.1016/j.csbj.2021.10.027. eCollection 2021.

Automatic cell type identification methods for single-cell RNA sequencing

Affiliations
Review

Automatic cell type identification methods for single-cell RNA sequencing

Bingbing Xie et al. Comput Struct Biotechnol J. .

Abstract

Single-cell RNA sequencing (scRNA-seq) has become a powerful tool for scientists of many research disciplines due to its ability to elucidate the heterogeneous and complex cell-type compositions of different tissues and cell populations. Traditional cell-type identification methods for scRNA-seq data analysis are time-consuming and knowledge-dependent for manual annotation. By contrast, automatic cell-type identification methods may have the advantages of being fast, accurate, and more user friendly. Here, we discuss and evaluate thirty-two published automatic methods for scRNA-seq data analysis in terms of their prediction accuracy, F1-score, unlabeling rate and running time. We highlight the advantages and disadvantages of these methods and provide recommendations of method choice depending on the available information. The challenges and future applications of these automatic methods are further discussed. In addition, we provide a free scRNA-seq data analysis package encompassing the discussed automatic methods to help the easy usage of them in real-world applications.

Keywords: Automatic identification; Cell type; Eager learning; Lazy learning; Marker learning; Single-cell RNA sequencing (scRNA-seq).

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

Fig. 1
Fig. 1
Workflow of the traditional and automatic cell-type identification methods. A. The workflow of traditional cell-type identification methods showing that the input of traditional methods are the testing datasets. An unsupervised method is used to cluster the cells, and the differentially expressed genes of each cluster are detected. The cell types of each cluster are assigned by the canonical markers in the differentially expressed genes. B. The workflow of the automatic cell-type identification methods. The input of eager learning and lazy learning methods are the training datasets and testing datasets. The input of marker learning methods is the markers of each cell type and the testing datasets. The training datasets can be downloaded from the data resource centers (GEO, ArrayExpress and GSA). The markers of each cell type can be downloaded from the marker resource centers (PanglaoDB, CellMarker and CancerSEA). The methods used by eager learning, lazy learning and marker learning methods are classifiers, nearest neighbor cells, and the scoring functions, respectively. The cell types assigned by the automatic methods can be given to cells or clusters.
Fig. 2
Fig. 2
Performance of the automatic cell-type identification methods using the Tabula Muris datasets. A. Schematic illustration of the automatic methods regarding reproducibility and applicability. Eleven mouse tissues (limb muscle, liver, thymus, tongue, bladder, mammary gland, spleen, trachea, lung, marrow and kidney) were used to test the self-projection. In applicability, the Smart-seq2 dataset of limb muscle, liver, thymus, and tongue is used as training datasets. The 10x datasets of bladder, mammary gland, spleen, trachea, lung, marrow and kidney are used as training datasets. B. The accuracy, F1-score and unlabeling rate in eleven mouse tissues. The heatmap is ordered by the accuracy in all three types of automatic methods. C. The accuracy, F1-score and unlabeling rate across different platforms. The heatmap is ordered by the accuracy in all the three types of automatic methods. The labels ‘5’ and ‘15’ in the marker learning methods and some of the eager learning methods indicate that they use the top 5 or top 15 differentially expressed markers.
Fig. 3
Fig. 3
Performance of the automatic cell-type identification methods using PBMC and tumor datasets. A. Circos plot shows the accuracy, F1-score and unlabeling rate of the PBMC datasets. The methods are ordered by the accuracy in all three types of automatic methods. “Cano” in marker learning methods or some of the eager learning methods: canonical markers. B. The performance of the automatic methods using human normal lung data to predict Tabula Muris lung data. As ACTINN, MARS, SciBet, scVI, Seurat and SingleR did not predict unlabeled cells, they are not included the calculation of sensitivity and specificity of tumor cells. scClassifR and SingleCellNet are not included since they did not predicate any unlabeled cells.
Fig. 4
Fig. 4
Speed of automatic cell-type identification methods. A. Speed of the automatic methods. A fixed size of the testing dataset and varying sizes of the training datasets are used to test the computation time using different training datasets. Also, a fixed size of the training dataset and varying sizes of testing datasets are used to test the computation time using different testing datasets. B. The computation time of the automatic methods with the training dataset set at 500, 1000, 2500, 5000 and 10,000 cells, and the testing dataset set at 5000 cells. The marker learning methods are not included since they do not require training datasets. C. The computation time of the automatic methods with the training dataset set at 700 cells, and the testing dataset set at 1000, 2000, 5000, 10,000, 20,000 and 50,000 cells.
Fig. 5
Fig. 5
Summary of performance of the automatic cell-type identification methods. Bar graphs of the automatic cell-type identification methods with six evaluation criteria indicated. For each evaluation criteria, the length of the bars shows the performance of the automatic method: poor, median or good. The automatic methods are sorted based on the mean performance of the evaluation criteria. No bar: not evaluated.

References

    1. Tang F., Barbacioru C., Wang Y., Nordman E., Lee C., Xu N. mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods. 2009;6:377–382. - PubMed
    1. Islam S., Kjallquist U., Moliner A., Zajac P., Fan J.-B., Lonnerberg P. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res. 2011;21:1160–1167. - PMC - PubMed
    1. Schaum N., Karkanias J., Neff N.F., May A.P., Quake S.R. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature. 2018;562:367. - PMC - PubMed
    1. Han X., Wang R., Zhou Y., Fei L., Sun H., Lai S. Mapping the mouse cell atlas by microwell-seq. Cell. 2018;173:1307. - PubMed
    1. Han X., Zhou Z., Fei L., Sun H., Wang R., Chen Y. Construction of a human cell landscape at single-cell level. Nature. 2020;581:303–309. - PubMed

LinkOut - more resources