Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jul 8;14(1):4050.
doi: 10.1038/s41467-023-39895-3.

Leveraging spatial transcriptomics data to recover cell locations in single-cell RNA-seq with CeLEry

Affiliations

Leveraging spatial transcriptomics data to recover cell locations in single-cell RNA-seq with CeLEry

Qihuang Zhang et al. Nat Commun. .

Abstract

Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular heterogeneity in health and disease. However, the lack of physical relationships among dissociated cells has limited its applications. To address this issue, we present CeLEry (Cell Location recovEry), a supervised deep learning algorithm that leverages gene expression and spatial location relationships learned from spatial transcriptomics to recover the spatial origins of cells in scRNA-seq. CeLEry has an optional data augmentation procedure via a variational autoencoder, which improves the method's robustness and allows it to overcome noise in scRNA-seq data. We show that CeLEry can infer the spatial origins of cells in scRNA-seq at multiple levels, including 2D location and spatial domain of a cell, while also providing uncertainty estimates for the recovered locations. Our comprehensive benchmarking evaluations on multiple datasets generated from brain and cancer tissues using Visium, MERSCOPE, MERFISH, and Xenium demonstrate that CeLEry can reliably recover the spatial location information for cells using scRNA-seq data.

PubMed Disclaimer

Conflict of interest statement

K.L. and B.Z. are employees of Biogen Inc. M.L. received research funding from Biogen Inc. unrelated to the current manuscript. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Workflow of CeLEry.
a CeLEry takes an ST dataset as input for model training and a scRNA-seq dataset as input for cell location prediction. CeLEry has an optional data augmentation step, which optionally generates replicates of the ST data via a variational autoencoder. The generated data are then included in the training data. A deep neural network is trained to learn the relationship between the spot-wise gene expression and location information by minimizing a loss function that is specified according to the specific problem. Then, the trained model is applied to the scRNA-seq data to predict the location of each cell. b The data augmentation procedure consists of an encoding stage and a decoding stage. In the encoding stage, the 2D expression pattern for each gene is summarized into the embeddings of mean and standard error vectors of a multivariate normal distribution. In the decoding stage, new embeddings are generated via the multivariate normal distribution given by the encoding stage. Meanwhile, a clustering algorithm is performed to cluster genes into groups. Finally, the generated embedding and the gene cluster embedding are concatenated, which is used as input for a convolutional neural network to decode the concatenated embedding into a 2D matrix with the same dimension as the gene expression input. The resulting 2D matrix can generate replicates of the ST data.
Fig. 2
Fig. 2. Evaluation of the data augmentation procedure.
a Examples of six gene clusters obtained from the mouse posterior brain ST dataset, where each row contains four representative genes from a cluster. b, c Examples of 2D gene expression maps for genes RPS20 and CALB1, together with three of their replicates generated from the data augmentation procedure. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. Cortical layer recovery for spots in the LIBD human DLPFC data.
a Four scenarios were considered in the evaluation. These scenarios vary in the number of tissue sections in the training data and the source of the test data, representing situations with different degrees of location recovery difficulty. b The overall layer prediction accuracy for Scenario 2 with different numbers of replicates obtained from the data augmentation procedure. c Layerwise prediction accuracies under different scenarios using CeLEry without data augmentation, CeLEry with data augmentation (two replicates), novoSpaRc, spaOTsc, and Tangram. The results for Tangram are missing for Scenarios 3 and 4 because Tangram can only take one tissue section as the training data. L1: layer 1, L2: layer 2, L3: layer 3, L4: layer 4, L5: layer 5, L6: layer 6, WM white matter. d Overall top-1 and top-2 prediction accuracies for CeLEry, CeLEry with data augmentation, Tangram, spaOTsc, and novoSpaRc, under different scenarios. It is noteworthy that Tangram, spaOTsc, and novoSpaRc are not applicable for Scenarios 3 and 4. e Visualization of the probabilities of assigning each spot to different layers (shown as different rows) in using CeLEry without data augmentation (Scenario 2), CeLEry with data augmentation (Scenario 2, 2 replicates), CeLEry with multiple training samples (Scenario 4), novoSpaRc, spaOTsc, and Tagram (Scenario 2). For Tangram, novoSpaRc, and spaOTsc, the probability of predicting a cell to be in a layer is calculated by summing up the probabilities of predicting the cell for all spots belonging to that specific layer. The ground truth cortical layer structure for the test sample is shown on the right. L1–L6 and WM were defined similarly as in Fig. 3c. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. Layer recovery for single cells in the AD study.
a Overview of the snRNA-seq data generation and analysis procedure for the AD study. In total, 15 postmortem brains were processed for snRNA-seq and analysis. b Numbers of cells mapped to L1–L6 and WM by CeLEry. L1: layer 1, L2: layer 2, L3: layer 3, L4: layer 4, L5: layer 5, L6: layer 6, WM white matter. c Boxplots for the distribution of the maximum probability of assigning a cell to a layer (left panel; L1: n = 11,807 cells; L2: n = 15,745 cells; L3: n = 15,949 cells; L4: n = 14,121 cells; L5: n = 13,332 cells; L6: n = 25,107 cells; WM: n = 22,988 cells) and of assigning a cell to a layer by cell types (middle panel; Ast: n = 12,126 cells; End: n = 444 cells; Ex: n = 39,176 cells; In: n = 12,286 cells; Mic: n = 3,982 cells; Oli: n = 44,182 cells; Opc: n = 6,853 cells). In each boxplot, the lower and upper hinges correspond to the first and third quartiles, and the center refers to the median value. The upper (lower) whiskers extend from the hinge to the largest (smallest) value no further (at most) than the 1.5 × interquartile range from the hinge. Data beyond the end of the whiskers are plotted individually. Layers or cell types that have higher mapping certainty will show higher maximum probabilities. Proportions of cells that are mapped to each layer by cell types (right panel). L1–L6 and WM were defined similarly as in Fig. 4b. Ast astrocyte, End endothelial, Ex excitatory neuron, In an inhibitory neuron, Mic microglia, Oli oligodendrocyte, Opc oligodendrocyte progenitor cell. d Comparison of the proportions of neuronal cells mapped to each layer by disease status (left: excitatory neurons; middle: inhibitory neurons; right: excitatory and inhibitory neurons combined). Within each cell type, we compared the mapped cell proportions by layer between the A+T− and A−T− groups and between the A+T+ and A−T− groups. The comparison was conducted by a one-sided two-sample Z-test to test the null hypothesis that the diseased group (A+T− or A+T+) has more neuronal cells than the control group (A−T−). L1–L6 and WM were defined similarly as in Fig. 4b. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. 2D location recovery for spots in the mouse posterior brain data.
a Scatter density plots of pairwise distances of true locations versus the predicted locations when the holdoff rate was 30% after resolution enhancement by TESLA. b Pearson’s correlation of the pairwise distances of predicted locations and the distances of true locations for CeLEry, novoSpaRc, spaOTsc, and Tangram when the holdoff rate was set at 10, 30, and 50%, respectively. c The recovered gene expression map in the test set based on predicted locations when the holdoff rate was 30%. The color shows relative gene expression. d MSEs of the predicted 2D locations when Gaussian noise with different standard deviations was added to the test data for scenarios with 10%, 30%, and 50% holdoff rates, respectively. The red line represents Tangram, and the shallow and dark blue lines represent CeLEry with two and ten replicates, respectively. Source data are provided as a Source Data file.
Fig. 6
Fig. 6. 2D location recovery for single cells in MERSCOPE mouse brain data.
a Three scenarios with varying degrees of complexity were considered for benchmark evaluations. The color of each cell indicates cluster assignment obtained from unsupervised clustering. b Overlay of the three replicates based on their spatial coordinates. c Barplot of Pearson correlation between true and predicted pairwise distances for all cell pairs. d Scatter density plot comparing true and predicted pairwise distances for all pairs in Scenario 3. Color in the plot indicates the density of cell pairs. e Boxplot of Euclidean distances between true and predicted locations for all cells in the test data (n = 18,342 cells for each method under all scenarios). In each boxplot, the lower and upper hinges correspond to the first and third quartiles, and the center refers to the median value. The upper (lower) whiskers extend from the hinge to the largest (smallest) value no further (at most) than 1.5 × interquartile range from the hinge. Data beyond the end of the whiskers are plotted individually. f Visualization of Euclidean distances between true and predicted locations for all cells in the test data for Scenario 3. g Recovered gene expression map of a randomly selected gene ADRB1, based on the predicted locations by CeLEry in Scenario 3, with color indicating relative gene expression. Source data are provided as a Source Data file.
Fig. 7
Fig. 7. 2D location recovery for single cells in MERFISH mouse brain data.
a Two scenarios with varying degrees of complexity were considered for benchmark evaluations. b Overlay of the two training slices based on their original spatial coordinates (left) and coordinates after manual location matching (right). c Barplot of Pearson correlation between true and predicted pairwise distances for all cell pairs. SpaOTsc and novoSpaRc were not able to handle a large number of cells in Scenario 2. d Scatter density plot comparing true and predicted pairwise distances for all cell pairs in Scenario 2. Color in the plot indicates the density of cell pairs. e Boxplot of Euclidean distances between true and predicted locations for all cells in the test data (n = 18,197 cells for each method under all scenarios). SpaOTsc and novoSpaRc were not able to handle a large number of cells in Scenario 2. In each boxplot, the lower and upper hinges correspond to the first and third quartiles, and the center refers to the median value. The upper (lower) whiskers extend from the hinge to the largest (smallest) value no further (at most) than the 1.5 × interquartile range from the hinge. Data beyond the end of the whiskers are plotted individually. f Visualization of Euclidean distance between the true and predicted locations for each cell in the test data for Scenario 2. g True and recovered gene expression maps of two ligand-receptor pairs, based on the predicted locations by CeLEry, with color indicating relative gene expression for Scenario 2. Source data are provided as a Source Data file.
Fig. 8
Fig. 8. Application to recover 2D locations for scRNA-seq data in mouse brain with different spatial references.
a 2D locations for scRNA-seq cells predicted by CeLEry, Tangram, spaOTsc, and novoSpaRc, using Vizgen’s MERSCOPE mouse brain data as the spatial reference. b 2D locations for scRNA-seq cells predicted by CeLEry, Tangram, spaOTsc, and novoSpaRc, using MERFISH mouse brain data generated by Zhang et al. as the spatial reference. Source data are provided as a Source Data file.
Fig. 9
Fig. 9. 2D location recovery for single cells in MERSCOPE human liver cancer data.
a Barplot of Pearson correlation between true and predicted pairwise distances for all cell pairs. b Scatter density plot comparing true and predicted pairwise distances for all cell pairs. Color in the plot indicates the density of cell pairs. c Boxplot of Euclidean distances between true and predicted locations for all cells in the test data (n = 19,885 cells). In each boxplot, the lower and upper hinges correspond to the first and third quartiles, and the center refers to the median value. The upper (lower) whiskers extend from the hinge to the largest (smallest) value no further (at most) than the 1.5 × interquartile range from the hinge. Data beyond the end of the whiskers are plotted individually. d Visualization of Euclidean distance between the true and predicted locations for each cell in the test data. e Recovered gene expression map of a randomly selected gene CDKN1B, based on the predicted locations by CeLEry, with color indicating relative gene expression. Source data are provided as a Source Data file.
Fig. 10
Fig. 10. 2D location recovery for single cells in 10X Xenium breast cancer data.
a Visualization of Replicate 1 and the three training scenarios investigated with varying spatial resolutions, including Xenium single-cell, artificial Visium spot-level, and enhanced spot-level, along with visualization of the single-cell Xenium replicate 2 test dataset. The color of each cell in the training sample and testing sample indicates cluster assignment obtained from unsupervised clustering. b Side-by-side bar graph of Pearson correlation between true and predicted pairwise distances across all cell pairs for each method and scenario evaluated. c Scatter density plots comparing true and predicted pairwise distances for all cell pairs in Scenario 1. Color in the plot indicates the density of cell pairs. d Side-by-side boxplots of Euclidean distances between true and predicted locations for all cells in the test data for each method and scenario evaluated (Scenario 1: n cells in the test set = 29,770, 8872, 7097, and 8872 for CeLery, Tangram, SpaOTsc, and novoSpaRc respectively due to the computational capacity of each method; Scenario 2: n cells in the test set = 29,770 for each method; Scenario 3: n cells in the test set = 29,770, 8872, 7097, and 8872 for CeLery, Tangram, SpaOTsc, and novoSpaRc respectively due to the computational capacity of each method. In each boxplot, the lower and upper hinges correspond to the first and third quartiles, and the center refers to the median value. The upper (lower) whiskers extend from the hinge to the largest (smallest) value no further (at most) than 1.5 × interquartile range from the hinge. Data beyond the end of the whiskers are plotted individually. e Visualization of Euclidean distances between true and predicted locations for all cells in the test data for Scenario 1. f Recovered gene expression map for Scenario 1 for a randomly selected gene, GATA3, with color indicating relative gene expression. Source data are provided as a Source Data file.

References

    1. Liao J, Lu X, Shao X, Zhu L, Fan X. Uncovering an organ’s molecular architecture at single-cell resolution by spatially resolved transcriptomics. Trends Biotechnol. 2021;39:43–58. doi: 10.1016/j.tibtech.2020.05.006. - DOI - PubMed
    1. Waylen LN, Nim HT, Martelotto LG, Ramialison M. From whole-mount to single-cell spatial assessment of gene expression in 3D. Commun. Biol. 2020;3:602. doi: 10.1038/s42003-020-01341-1. - DOI - PMC - PubMed
    1. Burgess DJ. Spatial transcriptomics coming of age. Nat. Rev. Genet. 2019;20:317. doi: 10.1038/s41576-019-0129-z. - DOI - PubMed
    1. Asp M, Bergenstrahle J, Lundeberg J. Spatially resolved transcriptomes-next generation tools for tissue exploration. Bioessays. 2020;42:e1900221. doi: 10.1002/bies.201900221. - DOI - PubMed
    1. Crosetto N, Bienko M, van Oudenaarden A. Spatially resolved transcriptomics and beyond. Nat. Rev. Genet. 2015;16:57–66. doi: 10.1038/nrg3832. - DOI - PubMed

Publication types

MeSH terms