Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec;4(12):955-977.
doi: 10.1038/s43588-024-00734-0. Epub 2024 Dec 20.

Mapping the gene space at single-cell resolution with gene signal pattern analysis

Affiliations

Mapping the gene space at single-cell resolution with gene signal pattern analysis

Aarthi Venkat et al. Nat Comput Sci. 2024 Dec.

Abstract

In single-cell sequencing analysis, several computational methods have been developed to map the cellular state space, but little has been done to map or create embeddings of the gene space. Here we formulate the gene embedding problem, design tasks with simulated single-cell data to evaluate representations, and establish ten relevant baselines. We then present a graph signal processing approach, called gene signal pattern analysis (GSPA), that learns rich gene representations from single-cell data using a dictionary of diffusion wavelets on the cell-cell graph. GSPA enables characterization of genes based on their patterning and localization on the cellular manifold. We motivate and demonstrate the efficacy of GSPA as a framework for diverse biological tasks, such as capturing gene co-expression modules, condition-specific enrichment and perturbation-specific gene-gene interactions. Then we showcase the broad utility of gene representations derived from GSPA, including for cell-cell communication (GSPA-LR), spatial transcriptomics (GSPA-multimodal) and patient response (GSPA-Pt) analysis.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Extended Data Fig. 1 |
Extended Data Fig. 1 |. Batch effect robustness in GSPA.
a. Dataset simulated with 3 clusters and 2 batches with batch effect. Gene embeddings colored by ground truth cluster association (differential expression factor) and batch effect association (batch effect factor) show separation by both. b. Dataset with batch effect corrected. Gene embeddings separate by cluster, but not batch effect. c. GSPA cell type association score correctly identifies relationship between genes and each cluster.
Extended Data Fig. 2 |
Extended Data Fig. 2 |. Overview of Gene Signal Pattern Analysis Comparisons.
Comparison names, methodology in text and diagram, and use of cell-cell graph based on shared properties of comparison.
Extended Data Fig. 3 |
Extended Data Fig. 3 |. Coexpression preservation in two-branch, three-branch single-cell simulations.
a. Experimental setup. b. Simulated dataset with two branches schematic. PHATE embedding of cells from noiseless simulation and noisy simulation, colored by pseudotime. Spearman correlation evaluating performance for all comparisons across 3 runs. c. Simulated dataset with three branches schematic. PHATE embedding of cells from noiseless simulation and noisy simulation, colored by pseudotime. Spearman correlation evaluating performance for all comparisons across 3 runs. d. PHATE embedding of genes from two branch simulation, colored by gene module assignments. Cells colored by gene module enrichment score. e. PHATE embedding of genes from three branch simulation, colored by gene module assignments. Cells colored by gene module enrichment score.
Extended Data Fig. 4 |
Extended Data Fig. 4 |. Transformation and graph construction robustness in GSPA.
a. Schematic of grid search of 2 transformations, 4 kNN choices, 3 kernels, and 2 replicates (48 runs total). b. Coexpression and localization experiment performance across all runs. c. Comparison of performance rank of methods that use cell-cell graph versus without cell-cell graph.
Extended Data Fig. 5 |
Extended Data Fig. 5 |. Schematic of generation of signals for localization experiment.
a. Noisy simulated data with pseudotime. b. Selection of windows of size δ where ground truth localization is 1 − δ. c. Examples of generated signals of different δ.
Extended Data Fig. 6 |
Extended Data Fig. 6 |. Differential localization in two-branch, three branch single-cell simulations.
a. Diagram of generated signals based on pseudotime window and anti-correlation between window size and localization. b. Two branch noisy simulated dataset, visualized with PHATE and colored by pseudotime. Spearman correlation evaluating performance for all comparisons across 3 runs. c. Three branch noisy simulated dataset, visualized with PHATE and colored by pseudotime. Spearman correlation evaluating performance for all comparisons across 3 runs.
Extended Data Fig. 7 |
Extended Data Fig. 7 |. Extended GSPA and comparison analysis for CD8+ T cells.
a. Cells clustered with top DEGs identified. b. Mean expression of top DEGs Rps19 and Rps20 and key CD8+ T cell marker genes per cluster. c. Coexpression networks of top localized genes in each gene cluster. d. Klf2 KO network.
Extended Data Fig. 8 |
Extended Data Fig. 8 |. Cell type association scores for peripheral tolerance model.
a. Gene embedding colored by cell type association ranking. b. Dot plot with top 10 genes associated with each cell type.
Extended Data Fig. 9 |
Extended Data Fig. 9 |. Response trajectories and biomarkers revealed by multiscale GSPA patient manifold.
a. Schematic of GSPA-Pt. b. PHATE visualization of patient embeddings based on GSPA+QR gene embeddings and comparisons. c. AUROC evaluation of response classification (logistic regression). d. Top genes predictive of response and non-response based on highest and lowest logistic regression coefficients. e. Patient embedding colored by percent of total cells annotated as B cells. f. Patient embedding with three samples from patient 1, 3, and 20 highlighted, corresponding to samples obtained from patients over time (pre-therapy baseline, post-therapy-1, and post-therapy-2). Trajectories of samples visualized.
Fig. 1 |
Fig. 1 |. Overview of GSPA.
a, Construction of a cell–cell graph, where nodes are cells and edges are affinities between cells based on similarity of transcriptomic measurements. b, Five demonstrative gene signals (triangles), where signals are continuous functions defined on nodes of cell–cell graph. c, Construction of diffusion wavelet Ψj at scale j and diffusion wavelet dictionary W, or QR-factorized dictionary W^ consisting of diffusion wavelets for scales 1,,J. Gene signals are projected onto the wavelet dictionary and gene embeddings are learned via an autoencoder architecture. d, Demonstrative gene embedding, where similar gene patterns are embedded closer together in the low-dimensional space, and far gene patterns are embedded far apart in the low-dimensional space. e, Differential localization determines how diffusely expressed gene signals are on a graph, where very diffusely expressed signals do not explain cell–cell variation. Genes a and e are most localized and gene c is least localized. f, Example downstream applications of GSPA, where gene embeddings enable cell-type-independent characterization of gene modules, cell–cell communication, spatial transcriptomics and patient manifolds.
Fig. 2 |
Fig. 2 |. Capturing coherent visualization and gene modules, trajectories and archetypes.
a, Noiseless (top) and noisy (bottom) cell embeddings of simulated linear trajectory, colored by ground-truth pseudotime provided by the simulation engine (left) and example gene expression (right).b, Experimental set-up (left) and Spearman correlation (ρ) evaluating performance on task for all comparisons across three runs (right). c, Gene embeddings of GSPA+QR and raw measurements colored by number of cells gene is expressed in. d, Gene modules detected by Leiden clustering (left) for GSPA+QR, and gene module enrichment and expression over time (right). Expression over time presented as mean expression of genes within module ±1 s.d. e, GSPA+QR gene embedding colored by time at which gene peaks.f, GSPA+QR gene embedding with archetypes identified via AAnet (left), with gene enrichment and expression over time visualized for ‘archetypal’ genes (genes closest to each archetype). Expression over time presented as mean expression of genes within module ±1 s.d.g, PBMC cell embedding and gene embedding with key PBMC markers annotated from PanglaoDB (top) and embryoid-body cell embedding and gene embedding (bottom), colored by diffusion eigenvector and key hemangioblast lineage markers annotated from ref. .
Fig. 3 |
Fig. 3 |. Differential localization analysis enabled by GSPA.
a, Differential localization diagram. b, Diagram of how localization reveals genes that are most distant from uniform. c, Spearman correlation evaluating performance for all comparisons across three runs. d, Original cell embedding versus cell embedding generated with predicted localized genes or predicted non-localized genes only; correlation between geodesic distances in original cell–cell graph versus feature-selected cell–cell graphs (for 100,000 pairwise distances subsampled twice). Shown for PBMCs (left) and embryoid-body data (right). EB, embryoid body.
Fig. 4 |
Fig. 4 |. Gene–gene co-expression in CD8+ T cells during acute and chronic infection.
a, PHATE embedding of antigen-specific CD8+ T cells from six experimental conditions (left) and marker genes visualized (right). b, Gene embedding visualized with PHATE, colored by gene module assignment. c, Gene embedding visualized with PHATE, colored by computed localization score. d, Cell clustering rank versus localization score, with representative genes visualized to demonstrate similarities and differences. e, Gene module enrichment across all cells and per condition (A, acute; C, chronic) and timepoint (days 4, 8 and 40). f, Enrichment of top localized genes enriched in gene module 5 for GSPA+QR (top), and gene set enrichment scores for type 1 interferon gene sets for top genes from all comparisons (bottom). g, kNN graph of gene–gene co-expression relationships that were knocked out in Tbx21 knockout.
Fig. 5 |
Fig. 5 |. Cluster-independent LR signal patterns in peripheral tolerance skin model.
a, Schematic of the GSPA-LR pipeline. b, Skin cells from no antigen (NO AG), antigen (AG) and antigen with checkpoint inhibitor (AG CPI) conditions visualized with PHATE. c, Skin cells colored by previously annotated cell types. d, Skin cells colored by CCL5, CCR5, PD-L1 and PD-1. e, Permutation test result from CellPhoneDB. f, LR pair embedding visualized with PHATE. g, Visualization of pairs, ligand and receptor enrichment, and gene set enrichment scores for module 5 (top) and module 19 (bottom). h, Pathway embedding visualized with PHATE.
Fig. 6 |
Fig. 6 |. Spatially localized gene signaling and immune hubs in 10x Visium human lymph node.
a, Schematic of GSPA-multimodal using integrated diffusion on spatial transcriptomic data,. b, Hematoxylin and eosin stain of human lymph node tissue. c, PHATE visualization of gene embedding, colored by gene module assignment (left) and localization score (right). d, Enrichment of gene modules spatially and visualization of top localized genes. e, Gene embedding with top spatially variable genes, where localization score corresponds with spatial variability (left) for n = 1,969 highly variable genes (one-sided Wilcoxon rank sums test, P9.47 × 10−7). Localized genes that are not significant by SpatialDE reveal stromal subset (right). f, Cell–cell communication networks derived from gene–gene interactions with OmniPathDB.

References

    1. Pearson KL III. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Phil. Mag. J. Sci. 2, 559–572 (1901).
    1. Becht E et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2018). - PubMed
    1. Moon KR et al. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol. 37, 1482–1492 (2019). - PMC - PubMed
    1. Grün D, Kester L & van Oudenaarden A Validation of noise models for single-cell transcriptomics. Nat. Methods 11, 637–640 (2014). - PubMed
    1. Kharchenko PV, Silberstein L & Scadden DT Bayesian approach to single-cell differential expression analysis. Nat. Methods 11, 740–742 (2014). - PMC - PubMed

LinkOut - more resources