Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Sep 20;24(5):bbad266.
doi: 10.1093/bib/bbad266.

scDeepInsight: a supervised cell-type identification method for scRNA-seq data with deep learning

Affiliations

scDeepInsight: a supervised cell-type identification method for scRNA-seq data with deep learning

Shangru Jia et al. Brief Bioinform. .

Abstract

Annotation of cell-types is a critical step in the analysis of single-cell RNA sequencing (scRNA-seq) data that allows the study of heterogeneity across multiple cell populations. Currently, this is most commonly done using unsupervised clustering algorithms, which project single-cell expression data into a lower dimensional space and then cluster cells based on their distances from each other. However, as these methods do not use reference datasets, they can only achieve a rough classification of cell-types, and it is difficult to improve the recognition accuracy further. To effectively solve this issue, we propose a novel supervised annotation method, scDeepInsight. The scDeepInsight method is capable of performing manifold assignments. It is competent in executing data integration through batch normalization, performing supervised training on the reference dataset, doing outlier detection and annotating cell-types on query datasets. Moreover, it can help identify active genes or marker genes related to cell-types. The training of the scDeepInsight model is performed in a unique way. Tabular scRNA-seq data are first converted to corresponding images through the DeepInsight methodology. DeepInsight can create a trainable image transformer to convert non-image RNA data to images by comprehensively comparing interrelationships among multiple genes. Subsequently, the converted images are fed into convolutional neural networks such as EfficientNet-b3. This enables automatic feature extraction to identify the cell-types of scRNA-seq samples. We benchmarked scDeepInsight with six other mainstream cell annotation methods. The average accuracy rate of scDeepInsight reached 87.5%, which is more than 7% higher compared with the state-of-the-art methods.

Keywords: cell annotation; deep learning; single-cell RNA sequencing; transformers.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The scDeepInsight pipeline: the key steps performed by scDeepInsight from inputting single-unique molecular identifier count matrix to outputting cell annotation prediction. A reference dataset with a query data is processed via quality control, normalization and correction of batch effects. Then, processed tabular data are converted into 2D embeddings. After framing and feature mapping, single-cell expression data are transformed into corresponding images. After this step, the reference dataset is used in training the CNN model. In the training step, no query dataset is used. Once the CNN model is trained, it is used to cluster single-cell samples from query dataset into cell-types. For subsequent query datasets, no further training of the reference dataset is performed, and therefore, the previously trained model can be directly used for clustering and annotation.
Figure 2
Figure 2
Preprocessing results of the test dataset Schulte–Schrepping. (A) The left plot is the quality control plot of the test dataset. Cells whose nFeature_RNA was less than 300 or more than 4000 were filtered out. Cells with percent.mt larger than 15 were also excluded. The right plot is labeled by the sequencing technology of data: samples in the reference dataset sequenced by 10× Chromium 3′ v3, samples in the query dataset sequenced by 10× Chromium 3′ v2 and 10× Chromium 3′ v3. (B) The Uniform Manifold Approximation and Projection (UMAP) representation of the reference before batch effect correction labeled by cell-types and data sources. (C) The UMAP representation of datasets after batch effect correction.
Figure 3
Figure 3
The performance of scDeepInsight. (A) Accuracy and ARI of scDeepInsight compared to other methods: SC3, FindClusters, SCINA, SingleR, CellTypist and scBERT, across the six datasets: Yaza, Schulte–Schrepping, Arunachalam, Lee, 10×-Multiome-Pbmc10k and Wilk. (B) The accuracy and ARI box plots of scDeepInsight and the other six methods used in benchmarking are depicted.
Figure 4
Figure 4
Cell-type labels of the reference dataset and prediction results on dataset Schulte–Schrepping. (A) The stacked percentage column chart of the prediction results on the Schulte–Schrepping dataset. (B) UMAP representation colored by cell-types in the original study. (C) Heatmap of the confusion matrix. (D) UMAP representation colored by cell-types predicted by scDeepInsight. (E) UMAP representation colored by annotation results using SC3 clustering with ScType.
Figure 5
Figure 5
(A) CD14 [29] and CDKN1C [30] were proven to be marker genes for monocytes, CD14 Monocytes and CD16 Monocytes correspondingly in the previous study. (B) The UMAP 2D embedding after performing normalization and principal component analysis (PCA) dimensionality reduction on the reference dataset. Cells are grouped and colored by known labels from previous studies.

References

    1. Clarke ZA, Andrews TS, Atif J, et al. Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods. Nat Protoc 2021;16:2749–64. - PubMed
    1. Yang F, Wang W, Wang F, et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell 2022;4:852–66.
    1. Hicks SC, Liu R, Ni Y, et al. Mbkmeans: fast clustering for single cell data using mini-batch k-means. PLoS Comput Biol 2021;17:e1008625. - PMC - PubMed
    1. Kiselev VY, Kirschner K, Schaub MT, et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 2017;14:483–6. - PMC - PubMed
    1. Waltman L, van Eck NJ. A smart local moving algorithm for large-scale modularity-based community detection. Eur Phys J B 2013;86:471.

Publication types