Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 2;40(5):btae293.
doi: 10.1093/bioinformatics/btae293.

scTPC: a novel semisupervised deep clustering model for scRNA-seq data

Affiliations

scTPC: a novel semisupervised deep clustering model for scRNA-seq data

Yushan Qiu et al. Bioinformatics. .

Abstract

Motivation: Continuous advancements in single-cell RNA sequencing (scRNA-seq) technology have enabled researchers to further explore the study of cell heterogeneity, trajectory inference, identification of rare cell types, and neurology. Accurate scRNA-seq data clustering is crucial in single-cell sequencing data analysis. However, the high dimensionality, sparsity, and presence of "false" zero values in the data can pose challenges to clustering. Furthermore, current unsupervised clustering algorithms have not effectively leveraged prior biological knowledge, making cell clustering even more challenging.

Results: This study investigates a semisupervised clustering model called scTPC, which integrates the triplet constraint, pairwise constraint, and cross-entropy constraint based on deep learning. Specifically, the model begins by pretraining a denoising autoencoder based on a zero-inflated negative binomial distribution. Deep clustering is then performed in the learned latent feature space using triplet constraints and pairwise constraints generated from partial labeled cells. Finally, to address imbalanced cell-type datasets, a weighted cross-entropy loss is introduced to optimize the model. A series of experimental results on 10 real scRNA-seq datasets and five simulated datasets demonstrate that scTPC achieves accurate clustering with a well-designed framework.

Availability and implementation: scTPC is a Python-based algorithm, and the code is available from https://github.com/LF-Yang/Code or https://zenodo.org/records/10951780.

PubMed Disclaimer

Conflict of interest statement

The authors declared that they have no conflicts of interest.

Figures

Figure 1.
Figure 1.
Overview of scTPC. A deep denoising autoencoder based on ZINB is constructed, with symmetrical structures for both the encoder and the decoder. The scRNA-seq data are used as input, and the outputs are three sets of parameters: dropout rate, mean value, and dispersion value. Deep clustering is performed on the embedded points in the latent space. The label information is integrated to generate triplet and pairwise constraints, and weighted cross-entropy is introduced to balance the dataset.
Figure 2.
Figure 2.
The comparison of NMI value of our model and the deep learning and semisupervised clustering methods on the five simulated datasets.
Figure 3.
Figure 3.
Clustering performance (a) NMI and (b) boxplot of different algorithms on 10 scRNA-seq datasets.
Figure 4.
Figure 4.
Visualizing clustering results on scRNA-seq datasets using UMAP. (a)–(d) are the visualization results of applying UMAP directly to datasets “Mouse_ES_cell,” “10X_PBMC,” “Wang_Lung,” and “Chen,” respectively. In contrast, (e)–(h) display the visualization results of the corresponding datasets after undergoing scTPC preprocessing.
Figure 5.
Figure 5.
Sankey plots on the datasets. (a) “Worm_neuron_cell,” (b) “Qx_Spleen,” (c) “Mouse_ES_cell”.
Figure 6.
Figure 6.
Clustering performance for the datasets with different proportion of labeled cells. (a) “Mouse_bladder_cell,” (b) “Wang_Lung,” (c) “10X_PBMC,” (d) “Splat3,” (e) “Splat4,” (f) “Splat5”.
Figure 7.
Figure 7.
Determination of the number of triplets. (a) “Mouse_ES_cell,” (b) “Worm_neuron_cell,” (c) “Young”.
Figure 8.
Figure 8.
The impact of the introduced semisupervised constraints on the model. (a) “10X_PBMC,” (b) “Qx_Spleen,” and (c) “Qx_Trachea”.

Similar articles

Cited by

References

    1. Basu S, Davidson I, Wagstaff K.. Constrained Clustering: Advances in Algorithms, Theory, and Applications. New York: CRC Press, 2008.
    1. Butler A, Hoffman P, Smibert P. et al. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol 2018;36:411–20. - PMC - PubMed
    1. Cao J, Packer JS, Ramani V. et al. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 2017;357:661–7. - PMC - PubMed
    1. Chen L, He Q, Zhai Y. et al. Single-cell RNA-seq data semi-supervised clustering and annotation via structural regularized domain adaptation. Bioinformatics 2021;37:775–84. - PubMed
    1. Chen L, Wang W, Zhai Y. et al. Deep soft k-means clustering with self-training for single-cell RNA sequence data. NAR Genom Bioinform 2020;2:lqaa039. - PMC - PubMed

Publication types