scDeepInsight: a supervised cell-type identification method for scRNA-seq data with deep learning

Shangru Jia¹, Artem Lysenko^{2

3}, Keith A Boroevich³, Alok Sharma^{2

3

4}, Tatsuhiko Tsunoda^{1

2

3}

Affiliations

¹ Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Japan.
² Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Japan.
³ Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Japan.
⁴ Institute for Integrated and Intelligent Systems, Griffith University, Australia.

PMID: 37523217
PMCID: PMC10516353
DOI: 10.1093/bib/bbad266

scDeepInsight: a supervised cell-type identification method for scRNA-seq data with deep learning

Shangru Jia et al. Brief Bioinform. 2023.

. 2023 Sep 20;24(5):bbad266.

doi: 10.1093/bib/bbad266.

Authors

Shangru Jia¹, Artem Lysenko^{2

3}, Keith A Boroevich³, Alok Sharma^{2

3

4}, Tatsuhiko Tsunoda^{1

2

3}

Affiliations

¹ Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Japan.
² Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Japan.
³ Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Japan.
⁴ Institute for Integrated and Intelligent Systems, Griffith University, Australia.

PMID: 37523217
PMCID: PMC10516353
DOI: 10.1093/bib/bbad266

Abstract

Annotation of cell-types is a critical step in the analysis of single-cell RNA sequencing (scRNA-seq) data that allows the study of heterogeneity across multiple cell populations. Currently, this is most commonly done using unsupervised clustering algorithms, which project single-cell expression data into a lower dimensional space and then cluster cells based on their distances from each other. However, as these methods do not use reference datasets, they can only achieve a rough classification of cell-types, and it is difficult to improve the recognition accuracy further. To effectively solve this issue, we propose a novel supervised annotation method, scDeepInsight. The scDeepInsight method is capable of performing manifold assignments. It is competent in executing data integration through batch normalization, performing supervised training on the reference dataset, doing outlier detection and annotating cell-types on query datasets. Moreover, it can help identify active genes or marker genes related to cell-types. The training of the scDeepInsight model is performed in a unique way. Tabular scRNA-seq data are first converted to corresponding images through the DeepInsight methodology. DeepInsight can create a trainable image transformer to convert non-image RNA data to images by comprehensively comparing interrelationships among multiple genes. Subsequently, the converted images are fed into convolutional neural networks such as EfficientNet-b3. This enables automatic feature extraction to identify the cell-types of scRNA-seq samples. We benchmarked scDeepInsight with six other mainstream cell annotation methods. The average accuracy rate of scDeepInsight reached 87.5%, which is more than 7% higher compared with the state-of-the-art methods.

Keywords: cell annotation; deep learning; single-cell RNA sequencing; transformers.

PubMed Disclaimer

Figures

**Figure 1**
The scDeepInsight pipeline: the key steps performed by scDeepInsight from inputting single-unique molecular identifier count matrix to outputting cell annotation prediction. A reference dataset with a query data is processed via quality control, normalization and correction of batch effects. Then, processed tabular data are converted into 2D embeddings. After framing and feature mapping, single-cell expression data are transformed into corresponding images. After this step, the reference dataset is used in training the CNN model. In the training step, no query dataset is used. Once the CNN model is trained, it is used to cluster single-cell samples from query dataset into cell-types. For subsequent query datasets, no further training of the reference dataset is performed, and therefore, the previously trained model can be directly used for clustering and annotation.

**Figure 2**
Preprocessing results of the test dataset Schulte–Schrepping. (A) The left plot is the quality control plot of the test dataset. Cells whose nFeature_RNA was less than 300 or more than 4000 were filtered out. Cells with percent.mt larger than 15 were also excluded. The right plot is labeled by the sequencing technology of data: samples in the reference dataset sequenced by 10× Chromium 3′ v3, samples in the query dataset sequenced by 10× Chromium 3′ v2 and 10× Chromium 3′ v3. (B) The Uniform Manifold Approximation and Projection (UMAP) representation of the reference before batch effect correction labeled by cell-types and data sources. (C) The UMAP representation of datasets after batch effect correction.

**Figure 3**
The performance of scDeepInsight. (A) Accuracy and ARI of scDeepInsight compared to other methods: SC3, FindClusters, SCINA, SingleR, CellTypist and scBERT, across the six datasets: Yaza, Schulte–Schrepping, Arunachalam, Lee, 10×-Multiome-Pbmc10k and Wilk. (B) The accuracy and ARI box plots of scDeepInsight and the other six methods used in benchmarking are depicted.

**Figure 4**
Cell-type labels of the reference dataset and prediction results on dataset Schulte–Schrepping. (A) The stacked percentage column chart of the prediction results on the Schulte–Schrepping dataset. (B) UMAP representation colored by cell-types in the original study. (C) Heatmap of the confusion matrix. (D) UMAP representation colored by cell-types predicted by scDeepInsight. (E) UMAP representation colored by annotation results using SC3 clustering with ScType.

**Figure 5**
(A) CD14 [29] and CDKN1C [30] were proven to be marker genes for monocytes, CD14 Monocytes and CD16 Monocytes correspondingly in the previous study. (B) The UMAP 2D embedding after performing normalization and principal component analysis (PCA) dimensionality reduction on the reference dataset. Cells are grouped and colored by known labels from previous studies.

See this image and copyright information in PMC

References

1. Clarke ZA, Andrews TS, Atif J, et al. Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods. Nat Protoc 2021;16:2749–64. - PubMed
1. Yang F, Wang W, Wang F, et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell 2022;4:852–66.
1. Hicks SC, Liu R, Ni Y, et al. Mbkmeans: fast clustering for single cell data using mini-batch k-means. PLoS Comput Biol 2021;17:e1008625. - PMC - PubMed
1. Kiselev VY, Kirschner K, Schaub MT, et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 2017;14:483–6. - PMC - PubMed
1. Waltman L, van Eck NJ. A smart local moving algorithm for large-scale modularity-based community detection. Eur Phys J B 2013;86:471.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

scDeepInsight: a supervised cell-type identification method for scRNA-seq data with deep learning

Affiliations

scDeepInsight: a supervised cell-type identification method for scRNA-seq data with deep learning

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources