Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 May 26;24(1):217.
doi: 10.1186/s12859-023-05339-4.

scSemiAAE: a semi-supervised clustering model for single-cell RNA-seq data

Affiliations

scSemiAAE: a semi-supervised clustering model for single-cell RNA-seq data

Zile Wang et al. BMC Bioinformatics. .

Abstract

Background: Single-cell RNA sequencing (scRNA-seq) strives to capture cellular diversity with higher resolution than bulk RNA sequencing. Clustering analysis is critical to transcriptome research as it allows for further identification and discovery of new cell types. Unsupervised clustering cannot integrate prior knowledge where relevant information is widely available. Purely unsupervised clustering algorithms may not yield biologically interpretable clusters when confronted with the high dimensionality of scRNA-seq data and frequent dropout events, which makes identification of cell types more challenging.

Results: We propose scSemiAAE, a semi-supervised clustering model for scRNA sequence analysis using deep generative neural networks. Specifically, scSemiAAE carefully designs a ZINB adversarial autoencoder-based architecture that inherently integrates adversarial training and semi-supervised modules in the latent space. In a series of experiments on scRNA-seq datasets spanning thousands to tens of thousands of cells, scSemiAAE can significantly improve clustering performance compared to dozens of unsupervised and semi-supervised algorithms, promoting clustering and interpretability of downstream analyses.

Conclusion: scSemiAAE is a Python-based algorithm implemented on the VSCode platform that provides efficient visualization, clustering, and cell type assignment for scRNA-seq data. The tool is available from https://github.com/WHang98/scSemiAAE .

Keywords: Adversarial autoencoder; Clustering; Deep learning; Semi-supervised; scRNA-seq.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
The illustration of scSemiAAE model. A The scRNA-seq count matrix X is preprocessed through gene filtering, screening of highly variable genes, and normalization. Next, it is divided into m_ and m depending on whether it contains true labels. B The encoder receives m_ and m to generate the corresponding latent variables z_ and z, respectively. C The SoftMax layer transforms the latent vector z_ into the pseudo-label c, which is then combined with the partial true label y_ to create a cross-entropy loss. D The decoder reconstructs the potential representation z with a zero-inflated negative binomial loss constraint. E Simultaneously, the latent feature z is fed to the discriminator for adversarial training, comprising the discriminator loss. F After completing training process, all the latent z and labels c are concatenated, and the final clustering results are given by a Gaussian mixture model
Fig. 2
Fig. 2
Latent representation visualization. The images base on embedded representations of the 10X_PBMC, Human kidney cells, Worm neuron cells, Human liver and CITE_PBMC datasets. Each dot indicates a cell, and the different colors of the dots point to the predicted labels
Fig. 3
Fig. 3
Benchmarking results on real scRNA-seq datasets. Clustering performances of scDeepCluster, scDSC, scDEC, scDHA, SC3, scGAE, scDCC, scAL, Itclust, scSemiAE and scSemiAAE, measured by ACC, NMI and ARI. The first six ones are unsupervised methods, and the remaining ones are semi-supervised clustering algorithms. A Comparison with semi-supervised clustering approaches on three datasets with the top 2000 highly scattered genes. B The results of unsupervised clustering algorithms. C scSemiAAE uses different proportions of labels on seven real datasets, measured by NMI
Fig. 4
Fig. 4
Model performance analysis of scSemiAAE. A Comparing the scalability of different algorithms on the real datasets by ARI and NMI metrics. B Clustering effects based on large-scale datasets. C Differential expression analysis bases on Baron (human) data

Similar articles

Cited by

References

    1. Petegrosso R, Li Z, Kuang R. Machine learning and statistical methods for clustering single-cell RNA-sequencing data. Brief Bioinform. 2020;21(4):1209–1223. - PubMed
    1. Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet. 2015;16(3):133–145. - PubMed
    1. Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet. 2019;20(5):273–282. - PubMed
    1. Kolodziejczyk AA, Kim JK, Svensson V, Marioni JC, Teichmann SA. The technology and biology of single-cell RNA sequencing. Mol Cell. 2015;58(4):610–620. - PubMed
    1. Shapiro E, Biezuner T, Linnarsson S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat Rev Genet. 2013;14(9):618–630. - PubMed

LinkOut - more resources