. 2024 Jan 15;25(1):25.

doi: 10.1186/s12859-024-05641-9.

Cellograph: a semi-supervised approach to analyzing multi-condition single-cell RNA-sequencing data using graph neural networks

Jamshaid A Shahir^{1

2

3}, Natalie Stanley^{2

3

4}, Jeremy E Purvis^{5

6

7

8}

Affiliations

¹ Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
² Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
³ Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
⁴ Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
⁵ Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA. purvisj@email.unc.edu.
⁶ Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA. purvisj@email.unc.edu.
⁷ Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA. purvisj@email.unc.edu.
⁸ Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA. purvisj@email.unc.edu.

PMID: 38221640
PMCID: PMC10788980
DOI: 10.1186/s12859-024-05641-9

Cellograph: a semi-supervised approach to analyzing multi-condition single-cell RNA-sequencing data using graph neural networks

Jamshaid A Shahir et al. BMC Bioinformatics. 2024.

. 2024 Jan 15;25(1):25.

doi: 10.1186/s12859-024-05641-9.

Authors

Jamshaid A Shahir^{1

2

3}, Natalie Stanley^{2

3

4}, Jeremy E Purvis^{5

6

7

8}

Affiliations

¹ Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
² Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
³ Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
⁴ Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
⁵ Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA. purvisj@email.unc.edu.
⁶ Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA. purvisj@email.unc.edu.
⁷ Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA. purvisj@email.unc.edu.
⁸ Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA. purvisj@email.unc.edu.

PMID: 38221640
PMCID: PMC10788980
DOI: 10.1186/s12859-024-05641-9

Abstract

With the growing number of single-cell datasets collected under more complex experimental conditions, there is an opportunity to leverage single-cell variability to reveal deeper insights into how cells respond to perturbations. Many existing approaches rely on discretizing the data into clusters for differential gene expression (DGE), effectively ironing out any information unveiled by the single-cell variability across cell-types. In addition, DGE often assumes a statistical distribution that, if erroneous, can lead to false positive differentially expressed genes. Here, we present Cellograph: a semi-supervised framework that uses graph neural networks to quantify the effects of perturbations at single-cell granularity. Cellograph not only measures how prototypical cells are of each condition but also learns a latent space that is amenable to interpretable data visualization and clustering. The learned gene weight matrix from training reveals pertinent genes driving the differences between conditions. We demonstrate the utility of our approach on publicly-available datasets including cancer drug therapy, stem cell reprogramming, and organoid differentiation. Cellograph outperforms existing methods for quantifying the effects of experimental perturbations and offers a novel framework to analyze single-cell data using deep learning.

Keywords: Graph neural networks; Semi-supervised learning; Single-cell genomics.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Illustrative overview of Cellograph algorithm. Single-cell data collected from multiple sample drug treatments (A, B) is converted to a kNN graph (C), where cells are nodes, and edges denote connections between transcriptionally similar cells. The colored rectangles (B) correspond to the different samples represented by the drugs in A. This kNN is fed in as input to a two-layer GCN (D) that quantitatively and visually learns how prototypical each cell is of its experimental label through the learned latent embedding. E A mathematical schematic of the first layer, where each cell’s gene expression and its neighbors’s gene expression is aggregated to produce a lower-dimensional representation of the cell in a latent space. F A mathematical schematic of the second layer respectively, where the output embedding of the first layer is mapped to softmax probabilities of cells belonging to each of the drug treatments

**Fig. 2**
Cellograph identifies treatment groups and distinguishes genes defining these groups on a human organoid dataset. A PHATE projection of learned latent space, with cells colored by treatment labels, probabilities of belonging to control or KPT-treated cells, clusters obtained by k-means clustering of the learned latent embedding with $k = 3$ , and gene expression of GDF15 and KLK7. B Heatmap of top 25 weighted genes from parameterized gene weight matrix. C Heatmap of differentially expressed genes between clusters derived from Cellograph. D Compositional plot of predicted treatment groups from the softmax probabilities ( $z_{ij} > 0.5$ ) (left) and cell types annotated by the original study (right) partitioned by clusters

**Fig. 3**
Cellograph defines genetic signatures of distinct drug responses in the drug holiday dataset. A PHATE embeddings of the learned latent space colored according to the treatment labels, clusters, and treatment probabilities (day 0 not shown). B Heatmap of top 25 weighted genes from learned parameterized gene weight matrix. C The distribution of treatment probabilities for Day 11 cells partitioned by treatment groups. D The distribution of gene expression between clusters 0, 5, 3, and 2 of select differentially expressed genes (INHBA, TUBA1B)

**Fig. 4**
Cellograph distinguishes the molecular mechanisms of transdifferentiation and dedifferentiation in myogenesis. A PHATE embeddings of learned latent space annotated according to treatment conditions, clusters, and softmax probabilities of all conditions except for MEFs, defining the in-group variation. B Heatmap of top weighted genes from parameterized gene weight matrix, identifying pertinent genes such as cyclin D1 and CRABP1. C Violin plot of softmax probabilities of cells belonging to the MyoD/day 4 treatment group, showing similarities to the MyoD/day 2 population. D Violin plots of top 20 differentially expressed genes between clusters 1 and 8 and clusters 3 and 9, which define the ${Pax7}^{+}$ cells and MyoD+FRC/day 8 treated cells, respectively. E Compositional plot of predicted cell types partitioned by cluster

**Fig. 5**
Results of running Milo and CNA on the datasets evaluated. A Output of running Milo and CNA on the human organoid dataset. B Output of running Milo and CNA on the drug holiday dataset. C Output of running Milo and CNA on the myogenesis dataset

**Fig. 6**
Runtime of Cellograph’s performance versus MELD’s on optimal parameter settings. Cellograph consistently outperforms MELD on each dataset, while using fewer computing resources (y-axis is log-scaled)

**Fig. 7**
Boxplots of NMI values per clustering algorithm. Distributions of 100 independent NMI calculations for each clustering algorithm for all three datasets evaluated, quantifying concordance between the cluster assignments and ground truth labels

See this image and copyright information in PMC

Cited by

Graph neural networks for single-cell omics data: a review of approaches and applications.
Li S, Hua H, Chen S. Li S, et al. Brief Bioinform. 2025 Mar 4;26(2):bbaf109. doi: 10.1093/bib/bbaf109. Brief Bioinform. 2025. PMID: 40091193 Free PMC article.
AI-Driven Transcriptome Prediction in Human Pathology: From Molecular Insights to Clinical Applications.
Chen X, Xu H, Yu S, Hu W, Zhang Z, Wang X, Yuan Y, Wang M, Chen L, Lin X, Hu Y, Cai P. Chen X, et al. Biology (Basel). 2025 Jun 4;14(6):651. doi: 10.3390/biology14060651. Biology (Basel). 2025. PMID: 40563902 Free PMC article. Review.
AI-Driven Quality Monitoring and Control in Stem Cell Cultures: A Comprehensive Review.
Singh R, Orimi HE, Pedabaliyarasimhuni PKR, Hoesli CA, Chioua M. Singh R, et al. Biotechnol J. 2025 Aug;20(8):e70100. doi: 10.1002/biot.70100. Biotechnol J. 2025. PMID: 40785233 Free PMC article. Review.
Uncovering latent biological function associations through gene set embeddings.
Huang Y, Zhong F, Liu L. Huang Y, et al. BMC Bioinformatics. 2025 Mar 24;26(1):90. doi: 10.1186/s12859-025-06100-9. BMC Bioinformatics. 2025. PMID: 40128671 Free PMC article.

References

1. Klein A, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz D, Kirschner M. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–1201. doi: 10.1016/j.cell.2015.04.044. - DOI - PMC - PubMed
1. Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol. 2019 doi: 10.15252/msb.20188746. - DOI - PMC - PubMed
1. Haghverdi L, Büttner M, Wolf FA, Buettner F, Theis FJ. Diffusion pseudotime robustly reconstructs lineage branching. Nat Methods. 2016 doi: 10.1101/041384. - DOI - PubMed
1. Dann E, Henderson NC, Teichmann SA, Morgan MD, Marioni JC. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nat Biotechnol. 2021;40(2):245–253. doi: 10.1038/s41587-021-01033-z. - DOI - PMC - PubMed
1. Reshef YA, Rumker L, Kang JB, Nathan A, Korsunsky I, Asgari S, Murray MB, Moody DB, Raychaudhuri S. Co-varying neighborhood analysis identifies cell populations associated with phenotypes of interest from single-cell transcriptomics. Nat Biotechnol. 2021;40(3):355–363. doi: 10.1038/s41587-021-01066-4. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Cellograph: a semi-supervised approach to analyzing multi-condition single-cell RNA-sequencing data using graph neural networks

Affiliations

Cellograph: a semi-supervised approach to analyzing multi-condition single-cell RNA-sequencing data using graph neural networks

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources