Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan 15;25(1):25.
doi: 10.1186/s12859-024-05641-9.

Cellograph: a semi-supervised approach to analyzing multi-condition single-cell RNA-sequencing data using graph neural networks

Affiliations

Cellograph: a semi-supervised approach to analyzing multi-condition single-cell RNA-sequencing data using graph neural networks

Jamshaid A Shahir et al. BMC Bioinformatics. .

Abstract

With the growing number of single-cell datasets collected under more complex experimental conditions, there is an opportunity to leverage single-cell variability to reveal deeper insights into how cells respond to perturbations. Many existing approaches rely on discretizing the data into clusters for differential gene expression (DGE), effectively ironing out any information unveiled by the single-cell variability across cell-types. In addition, DGE often assumes a statistical distribution that, if erroneous, can lead to false positive differentially expressed genes. Here, we present Cellograph: a semi-supervised framework that uses graph neural networks to quantify the effects of perturbations at single-cell granularity. Cellograph not only measures how prototypical cells are of each condition but also learns a latent space that is amenable to interpretable data visualization and clustering. The learned gene weight matrix from training reveals pertinent genes driving the differences between conditions. We demonstrate the utility of our approach on publicly-available datasets including cancer drug therapy, stem cell reprogramming, and organoid differentiation. Cellograph outperforms existing methods for quantifying the effects of experimental perturbations and offers a novel framework to analyze single-cell data using deep learning.

Keywords: Graph neural networks; Semi-supervised learning; Single-cell genomics.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Illustrative overview of Cellograph algorithm. Single-cell data collected from multiple sample drug treatments (A, B) is converted to a kNN graph (C), where cells are nodes, and edges denote connections between transcriptionally similar cells. The colored rectangles (B) correspond to the different samples represented by the drugs in A. This kNN is fed in as input to a two-layer GCN (D) that quantitatively and visually learns how prototypical each cell is of its experimental label through the learned latent embedding. E A mathematical schematic of the first layer, where each cell’s gene expression and its neighbors’s gene expression is aggregated to produce a lower-dimensional representation of the cell in a latent space. F A mathematical schematic of the second layer respectively, where the output embedding of the first layer is mapped to softmax probabilities of cells belonging to each of the drug treatments
Fig. 2
Fig. 2
Cellograph identifies treatment groups and distinguishes genes defining these groups on a human organoid dataset. A PHATE projection of learned latent space, with cells colored by treatment labels, probabilities of belonging to control or KPT-treated cells, clusters obtained by k-means clustering of the learned latent embedding with k=3, and gene expression of GDF15 and KLK7. B Heatmap of top 25 weighted genes from parameterized gene weight matrix. C Heatmap of differentially expressed genes between clusters derived from Cellograph. D Compositional plot of predicted treatment groups from the softmax probabilities (zij>0.5) (left) and cell types annotated by the original study (right) partitioned by clusters
Fig. 3
Fig. 3
Cellograph defines genetic signatures of distinct drug responses in the drug holiday dataset. A PHATE embeddings of the learned latent space colored according to the treatment labels, clusters, and treatment probabilities (day 0 not shown). B Heatmap of top 25 weighted genes from learned parameterized gene weight matrix. C The distribution of treatment probabilities for Day 11 cells partitioned by treatment groups. D The distribution of gene expression between clusters 0, 5, 3, and 2 of select differentially expressed genes (INHBA, TUBA1B)
Fig. 4
Fig. 4
Cellograph distinguishes the molecular mechanisms of transdifferentiation and dedifferentiation in myogenesis. A PHATE embeddings of learned latent space annotated according to treatment conditions, clusters, and softmax probabilities of all conditions except for MEFs, defining the in-group variation. B Heatmap of top weighted genes from parameterized gene weight matrix, identifying pertinent genes such as cyclin D1 and CRABP1. C Violin plot of softmax probabilities of cells belonging to the MyoD/day 4 treatment group, showing similarities to the MyoD/day 2 population. D Violin plots of top 20 differentially expressed genes between clusters 1 and 8 and clusters 3 and 9, which define the Pax7+ cells and MyoD+FRC/day 8 treated cells, respectively. E Compositional plot of predicted cell types partitioned by cluster
Fig. 5
Fig. 5
Results of running Milo and CNA on the datasets evaluated. A Output of running Milo and CNA on the human organoid dataset. B Output of running Milo and CNA on the drug holiday dataset. C Output of running Milo and CNA on the myogenesis dataset
Fig. 6
Fig. 6
Runtime of Cellograph’s performance versus MELD’s on optimal parameter settings. Cellograph consistently outperforms MELD on each dataset, while using fewer computing resources (y-axis is log-scaled)
Fig. 7
Fig. 7
Boxplots of NMI values per clustering algorithm. Distributions of 100 independent NMI calculations for each clustering algorithm for all three datasets evaluated, quantifying concordance between the cluster assignments and ground truth labels

Similar articles

Cited by

References

    1. Klein A, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz D, Kirschner M. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–1201. doi: 10.1016/j.cell.2015.04.044. - DOI - PMC - PubMed
    1. Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol. 2019 doi: 10.15252/msb.20188746. - DOI - PMC - PubMed
    1. Haghverdi L, Büttner M, Wolf FA, Buettner F, Theis FJ. Diffusion pseudotime robustly reconstructs lineage branching. Nat Methods. 2016 doi: 10.1101/041384. - DOI - PubMed
    1. Dann E, Henderson NC, Teichmann SA, Morgan MD, Marioni JC. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nat Biotechnol. 2021;40(2):245–253. doi: 10.1038/s41587-021-01033-z. - DOI - PMC - PubMed
    1. Reshef YA, Rumker L, Kang JB, Nathan A, Korsunsky I, Asgari S, Murray MB, Moody DB, Raychaudhuri S. Co-varying neighborhood analysis identifies cell populations associated with phenotypes of interest from single-cell transcriptomics. Nat Biotechnol. 2021;40(3):355–363. doi: 10.1038/s41587-021-01066-4. - DOI - PMC - PubMed

LinkOut - more resources