. 2023 Jul 1;39(7):btad435.

doi: 10.1093/bioinformatics/btad435.

Cytocipher determines significantly different populations of cells in single-cell RNA-seq data

Brad Balderson¹, Michael Piper², Stefan Thor², Mikael Bodén¹

Affiliations

¹ School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD 4072, Australia.
² School of Biomedical Sciences, University of Queensland, Brisbane, QLD 4072, Australia.

PMID: 37449901
PMCID: PMC10368802
DOI: 10.1093/bioinformatics/btad435

Cytocipher determines significantly different populations of cells in single-cell RNA-seq data

Brad Balderson et al. Bioinformatics. 2023.

. 2023 Jul 1;39(7):btad435.

doi: 10.1093/bioinformatics/btad435.

Authors

Brad Balderson¹, Michael Piper², Stefan Thor², Mikael Bodén¹

Affiliations

¹ School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD 4072, Australia.
² School of Biomedical Sciences, University of Queensland, Brisbane, QLD 4072, Australia.

PMID: 37449901
PMCID: PMC10368802
DOI: 10.1093/bioinformatics/btad435

Abstract

Motivation: Identification of cell types using single-cell RNA-seq is revolutionizing the study of multicellular organisms. However, typical single-cell RNA-seq analysis often involves post hoc manual curation to ensure clusters are transcriptionally distinct, which is time-consuming, error-prone, and irreproducible.

Results: To overcome these obstacles, we developed Cytocipher, a bioinformatics method and scverse compatible software package that statistically determines significant clusters. Application of Cytocipher to normal tissue, development, disease, and large-scale atlas data reveals the broad applicability and power of Cytocipher to generate biological insights in numerous contexts. This included the identification of cell types not previously described in the datasets analysed, such as CD8+ T cell subtypes in human peripheral blood mononuclear cells; cell lineage intermediate states during mouse pancreas development; and subpopulations of luminal epithelial cells over-represented in prostate cancer. Cytocipher also scales to large datasets with high-test performance, as shown by application to the Tabula Sapiens Atlas representing >480 000 cells. Cytocipher is a novel and generalizable method that statistically determines transcriptionally distinct and programmatically reproducible clusters from single-cell data.

Availability and implementation: The software version used for this manuscript has been deposited on Zenodo (https://doi.org/10.5281/zenodo.8089546), and is also available via github (https://github.com/BradBalderson/Cytocipher).

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
Overview of *Cytocipher* for cluster-specific ‘code-scoring’ and significant cluster analysis with ‘cluster-merge’. (A) Schematic illustration of the ‘Cytocipher code-scoring’ method. Briefly, (1) Gene expression and cluster annotation are provided as input; (2) determination of marker genes is performed; (3) genes, which are positive indicators versus negative indicators of cluster membership are determined through comparison of cluster marker genes; (4) for a given cluster, cells co-expressing the negative gene set/s score 0, while the remaining cells are scored for positive gene set co-expression; and (5) repetition of 1–4 for each cluster yields a diagnostic heatmap scoring each cell for membership of each cluster. (B) Schematic illustration of ‘Cytocipher cluster-merge’, which performs a test of significance difference between cluster pairs, merging those which are mutually non-significantly different. Briefly, (1) per-cell enrichment scores are determine using the process in (A); (2) where a large number of clusters are present, MNNs can determine cluster pairs for comparison; (3) cluster scores are compared with a statistical test to determine significant versus non-significant clusters, with non-significant cluster pairs being merged to create new cluster labels; (4) enrichment scores are redetermined based on the new cluster labels as an output diagnostic, and the process can be repeated with the new cluster labels until convergence

**Figure 2.**
‘Cytocipher code-scoring’ outperforms existing methods for enrichment of cell population marker gene combinations in simulated data. (A) UMAP of splatter simulated scRNA-seq data with eight groups, each of which has a unique combination of gene expression, with single marker genes demarcating a cluster from all other cells only present in a few cases. (B) UMAPs display the log-counts-per-million for each simulated differential gene between groups, illustrating gene combinatorics for cluster definition. (C) Results from performing per-cell-gene enrichment using the marker genes for each group illustrated in (B). Each sub-panel focuses on a particular group of cells, with the enrichment scores for ‘Cytocipher code-scoring’, ‘Cytocipher coexpr-scoring’, ‘Giotto PAGE’ enrichment, and ‘Scanpy-scoring’ shown alongside the highlighted cluster; clearly indicating high specificity with minimal background for the ‘code-scoring’ approach. (D) Enrichment score prediction metrics for predicting each group based on the enrichment scores for each method. To illustrate the effect of the negative gene set subtraction utilized by ‘Cytocipher code-scoring’, we also show performances for this approach applied to ‘Giotto PAGE’ scores [Giotto (−)] and ‘Scanpy-scoring’ [Scanpy (−)]. Positive scores for each enrichment method were used to indicate cluster membership. Accuracy, F1 score, precision, and recall measures are shown as violin plots, with each point in the violin representing the measure for a given group of cells, and separate violins indicating the enrichment method used for scoring. Tables above each violin plot summarize the average score for each method (μ). (E) ROC curve, for testing differences between artificial sub-clusters of each simulated group using ‘Cytocipher cluster-merge’ and each scoring method

**Figure 3.**
‘Cytocipher code-scoring’ performs comparably to existing methods for enrichment of cell population marker gene combinations in hypothalamus neuronal subtypes. (A) UMAP E18.5 hypothalamus scRNA-seq depicting 79 neuronal subtypes clusters. (B) Heatmap displaying per-cell enrichment scores for cluster membership depicted in (A); each row is a cell, and each column is a neuronal subtype cluster. Cells and clusters are ordered such that perfect correspondence of cells to score for their respective cluster lies on the diagonal of the heatmap. Hence, scoring outside of the diagonal indicates ‘cross-scoring’, where cells also score for gene expression outside of their cluster membership. Lack of scoring along the diagonal indicates cell gene expression does not match cluster membership. (C) Enrichment score prediction metrics for predicting each neuronal subtype cluster based on the enrichment scores for each method. Positive scores for each enrichment method were used to indicate cluster membership. Accuracy, F1 score, precision, and recall measures are shown as violin plots, with each point in the violin representing the measure for a given neuronal subtype, and separate violins indicating the enrichment method used for scoring. Tables above each violin plot summarize the average score for each method (μ). (D) ROC curve, for testing differences between artificial sub-clusters of each neuronal subtype using ‘Cytocipher cluster-merge’ and each scoring method

**Figure 4.**
Cross-cluster comparisons of code-scores can be used to merge over-clustered single-cell data, revealing novel heterogeneity in human PBMCs. (A) UMAP of Human 3K PBMC scRNA-seq, clustered at Leiden resolution 0.7 to create eight clusters. Cells are annotated by cell type; B-cells, CD4+ T cells, CD8+ T cells, NK cells, Dendritic cells, CD14+ Monocytes, FCGR3A+ Monocytes, and Megakaryocytes. (B) Over-clustering of the PBMC data at Leiden resolution 2.0, producing 19 clusters. (C) Heatmap depicting *Cytocipher* code-scores, where each row is a cell and each column is a cluster. Cells and clusters are ordered such that scores along the diagonal indicate scores of cells for their respective cluster. Cross-scoring of cells for different clusters is indicated with boxes; corresponding to the over-clustering introduced in (B). (D) Violin plots show example cluster significance tests. Non-significant and significant cluster pairs are indicated with crosses and ticks, respectively. The y-axes are the code-score, and the four violins within each plot indicate combinations of cells belonging to each cluster scoring for their own cluster and the cluster being compared against. Clusters are significantly different when a significant P-value is observed for either set of cluster scores when comparing cells between clusters. (E) UMAP depicts the PBMCs after merging non-significantly different clusters. New distinct populations are outlined. Small boxes of UMAPs depict the log-cpm expression of genes in the new distinct subpopulations (GZMK and GZMH). (F) The same as (C), except for the new clusters after merging. (G) UMAPs of the data clustered at increasing Leiden resolutions from left to right. Cases of over- and under-clustering are outlined and labelled. The new distinct clusters appear at resolution 1.5 (as outlined and lablled), at the same point where over-clustering is still clearly evident

**Figure 5.**
*Cytocipher* identifies intermediate cell states in mouse pancreas development. (A) UMAP of mouse E15.5 pancreas cells, with cells annotated by broad cell types; ductal, Ngn3 low EP cells, Ngn3 high EP cells, pre-endocrine, epsilon, beta, alpha, and delta. (B) UMAP with cells over-clustered at Leiden resolution 3.5, producing 30 clusters. (C) Heatmap of *Cytocipher* code-scores per cluster, as shown for clusters in panel (B). (D) UMAP of 16 significant clusters determined from ‘Cytocipher cluster-merge’ applied to the clusters depicted in panel (B). Novel intermediate states outlined. (E) Top marker genes determined for the clusters depicted in (D), with cells belonging to the relevant clusters of the marker genes outlined (see text for interpretation). (F) Heatmap of *Cytocipher* code-scores per cell (row) and cluster (column), for merged clusters depicted in panel (D)

**Figure 6.**
*Cytocipher* detects over-represented subpopulations in prostate cancer. (A) UMAP of scRNA-seq from normal and tumour tissue from the human prostate, consisting of 15 492 cells. Data as provided by Tuong *et al.* (2021). (B) *Cytocipher* significant clusters after merging 47 clusters produced from Leiden clustering at resolution 4.0 (depicted as inner UMAP). (C) Heatmap depicting *Cytocipher* code-scores, where each row is a cell and each column is a cluster. Cells and clusters are ordered such that scores along the diagonal indicate scores of cells for their respective cluster. Code-scores for the 29 significant clusters in panel (B) are shown. (D) Inner volcano plot depicts −log10(p-adjusted) on the y-axis and log-fold change on the x-axis testing for differential abundance of cells (DA) on the cell–cell neighbourhood graph using ‘Milo’. The outer UMAP depicts non-significant cells in the background, and significantly DA cellular neighbourhoods are highlighted. (E) UMAP highlights the significant clusters detected by *Cytocipher* that were independently determined as over-represented in prostate cancer by ‘Milo’ DA analysis. (F) Violin plots with −log10(p-adjusted) on the y-axis and *Cytocipher* significant clusters on the x-axis. A horizontal line indicates the p-adjusted cutoff of 0.05, cells above this line are considered significantly DA between tumour and cancer samples. (G) Equivalent to (F), except for the original prostate cell types depicted in (A). (H) *Cytocipher* code-scores for the prostate cancer over-represented clusters detected by *Cytocipher*, from left-to-right code-scores depicted are specific to Clusters 25, 6, 0, and 9. The marker gene set for each respective cluster is shown within the UMAP plots. (I) Equivalent to (H), except depicting Clusters 17, 16, 12, 2, and 8. These clusters also score for Cluster 9 as depicted in (H), but are significantly different due to additional gene co-expression

**Figure 7.**
*Cytocipher* scales to >480 000 cells with high-test performance. (A) Tabula Sapiens UMAP depicting 483 152 cells sampled from 24 tissues by the Tabula Sapiens Consortium (2022). (B) Heatmap depicting *Cytocipher* code-scores for the 177 cell types annotated within the Tabula Sapiens dataset. Each row is a cell and each column is a cluster. Cells and clusters are ordered such that scores along the diagonal indicate scores of cells for their respective cluster. (C) Sankey diagram, indicating the top four over-merged clusters by *Cytocipher* when testing with artificial random sub-groups of the 177 cell types. The left side of the diagram indicates the sub-grouped cell types, and the right side indicates the sub-grouped cell types merged by *Cytocipher*. (D) ROC curve depicting the true-positive rate on the y-axis and the false-positive rate on the x-axis at different P-value cutoffs using ‘Cytocipher cluster-merge’ applied to artificial random subgroups of the 177 cell types using either ‘code-scoring’, ‘coexpr-scoring’, ‘Scanpy-scoring’, or ‘Scanpy-scoring’ with negative gene set subtraction [Scanpy (−)]. AUC for each scoring method is indicated in the legend. (E) Bar charts indicate time and memory usage of *Cytocipher* when analysing the 483 152 cells across artificial cell type sub-groups. ‘Giotto PAGE’ could not be performed on the full dataset due to memory limitations. (F and G) Equivalent to (D) and (E), except downsampling each of the cell type subgroups to a maximum of 15 cells to reduce the dataset size to 7385 cells, enabling ‘Giotto PAGE’ to be run for comparison and examine the effect of fewer cells on test performance. (H) Small comparison between *Cytocipher* and ‘Sc-SHC’, using the 500 cells and 12 random cell types subsetted from the 177 cell types. Ticks on the right hand side of the Sankey diagrams indicate artificial over-clusters were correctly merged, while crosses indicate incorrect merging. (I) Bar plot indicating run-time for the different methods, with methods depicted on the y-axis and run-time on the x-axis

See this image and copyright information in PMC

Cited by

CHOIR improves significance-based detection of cell types and states from single-cell data.
Sant C, Mucke L, Corces MR. Sant C, et al. bioRxiv [Preprint]. 2025 Feb 19:2024.01.18.576317. doi: 10.1101/2024.01.18.576317. bioRxiv. 2025. Update in: Nat Genet. 2025 May;57(5):1309-1319. doi: 10.1038/s41588-025-02148-8. PMID: 38328105 Free PMC article. Updated. Preprint.
Systematic analysis of the transcriptional landscape of melanoma reveals drug-target expression plasticity.
Balderson B, Fane M, Harvey TJ, Piper M, Smith A, Bodén M. Balderson B, et al. Brief Funct Genomics. 2025 Jan 15;24:elad055. doi: 10.1093/bfgp/elad055. Brief Funct Genomics. 2025. PMID: 38183207 Free PMC article.
CHOIR improves significance-based detection of cell types and states from single-cell data.
Sant C, Mucke L, Corces MR. Sant C, et al. Nat Genet. 2025 May;57(5):1309-1319. doi: 10.1038/s41588-025-02148-8. Epub 2025 Apr 7. Nat Genet. 2025. PMID: 40195561

References

1. Ahlgren U, Pfaff SL, Jessell TM. et al. Independent requirement for ISL1 in formation of pancreatic mesenchyme and islet cells. Nature 1997;385:257–60. 10.1038/385257a0. - DOI - PubMed
1. Aibar S, González-Blas CB, Moerman T. et al. SCENIC: single-cell regulatory network inference and clustering. Nat Methods 2017;14:1083–6. 10.1038/nmeth.4463. - DOI - PMC - PubMed
1. Alexander Wolf F, Angerer P, Theis FJ.. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 2018;19:15. 10.1186/s13059-017-1382-0. - DOI - PMC - PubMed
1. Bastidas-Ponce A, Tritschler S, Dony L. et al. Comprehensive single cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis. Development 2019;146:dev173849. 10.1242/dev.173849. - DOI - PubMed
1. Bergen V, Lange M, Peidli S. et al. Generalizing RNA velocity to transient cell states through dynamical modeling. Nat Biotechnol 2020;38:1408–14. 10.1038/s41587-020-0591-3. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Cytocipher determines significantly different populations of cells in single-cell RNA-seq data

Affiliations

Cytocipher determines significantly different populations of cells in single-cell RNA-seq data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Research Materials