Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Feb;17(2):193-200.
doi: 10.1038/s41592-019-0701-7. Epub 2020 Jan 27.

Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies

Affiliations

Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies

Shiquan Sun et al. Nat Methods. 2020 Feb.

Abstract

Identifying genes that display spatial expression patterns in spatially resolved transcriptomic studies is an important first step toward characterizing the spatial transcriptomic landscape of complex tissues. Here we present a statistical method, SPARK, for identifying spatial expression patterns of genes in data generated from various spatially resolved transcriptomic techniques. SPARK directly models spatial count data through generalized linear spatial models. It relies on recently developed statistical formulas for hypothesis testing, providing effective control of type I errors and yielding high statistical power. With a computationally efficient algorithm, which is based on penalized quasi-likelihood, SPARK is also scalable to datasets with tens of thousands of genes measured on tens of thousands of samples. Analyzing four published spatially resolved transcriptomic datasets using SPARK, we show it can be up to ten times more powerful than existing methods and disclose biological discoveries that otherwise cannot be revealed by existing approaches.

PubMed Disclaimer

Conflict of interest statement

Competing interests

The authors declare that they have no competing interests.

Figures

Figure 1:
Figure 1:. Method schematic of SPARK and simulation results.
(a) Method schematic of SPARK. SPARK examines one gene at a time and models the gene expression measurements on spatial locations using the generalized linear spatial model. To detect whether the gene shows spatial expression pattern, SPARK relies on a series of spatial kernels for pattern recognition and outputs a p-value for each spatial kernel using the Satterthwaite method that enables exact p-value computation. All these p-values from different spatial kernels are subsequently combined into a final SPARK p-value through the Cauchy combination rule. Here,wi is a weight for the combination (set to be 1/10 throughout the text) and tan(.) denotes the tangent function. (b) Quantile-quantile plot of the observed -log10 p-values from different methods against the expected -log10 p-values under the null for the first set of null simulations based on the mouse olfactory bulb data (n = 260 array spots). The p-values are combined across ten simulation replicates, with each replicate containing 10,000 simulated genes. Simulations are performed under moderate noise (τ2=0.35). Compared methods include SPARK (pink), SpatialDE (purple), Trendsceek.E (light salmon) which is the Emark test of Trendsceek, Trendsceek. ρ (yellow-green) which is the Markcorr test of Trendsceek, Trendsceek. γ (light green) which is the Markvario test of Trendsceek, and Trendsceek.V (wheat) which is the Vmark test of Trendsceek. Representative expression pattern for a null gene that does not show a spatial expression pattern is embedded inside the panel. (c) Power plots show the proportion of true positives (y-axis) detected by different methods at a range of false discovery rates (FDR; x-axis) for the first set of alternative simulations based on the mouse olfactory bulb data (n = 260 array spots). Representative genes displaying each of the three spatial expression patterns I-III are embedded inside the panels. The proportion of true positives is averaged across ten simulation replicates, with each replicate containing 1,000 SE genes and 9,000 non-SE genes. Simulations are performed under moderate noise (τ2=0.35) and moderate SE strength (threefold). Trendsceek (sky-blue) is the combined test of Trendsceek. (d) Quantile-quantile plot of the observed -log10 p-values from different methods against the expected -log10 p-values under the null for the second set of null simulations based on the seqFISH data. p-values are combined across ten simulation replicates, with each replicate containing 1,000 simulated genes. Simulations are performed under moderate sample size (n = 200 cells). Representative expression pattern for a null gene that does not show a spatial expression pattern is embedded inside the panel. (e) Power plots show the proportion of true positives (y-axis) detected by different methods at a range of false discovery rates (FDR; x-axis) for the second set of alternative simulations based on the seqFISH data with moderate sample size (n = 200 cells). Representative genes displaying each of the three spatial expression patterns are embedded inside the panels. The proportion of true positives is averaged across ten simulation replicates, with each replicate containing 100 SE genes and 900 non-SE genes. Simulations were performed under moderate fraction of marked cells (20%) and moderate SE strength (2-fold) for the hotspot and streak patterns, or under moderate SE strength (40% cells displaying expression gradient) for the linear gradient pattern.
Figure 2:
Figure 2:. Analyzing the mouse olfactory bulb data (n = 260 spots).
(a) Quantile-quantile plot of the observed -log10 p-values from different methods are plotted against the expected -log10 p-values under the null in the permuted data. p-values are combined across ten permutation replicates. Compared methods include SPARK (pink), SpatialDE (purple), Trendsceek.E (light salmon) which is the Emark test of Trendsceek, Trendsceek. ρ (yellow-green) which is the Markcorr test of Trendsceek, Trendsceek. γ (light green) which is the Markvario test of Trendsceek, and Trendsceek.V (wheat) which is the Vmark test of Trendsceek. (b) Power plot shows the number of genes with spatial expression pattern (y-axis) identified by different methods at a range of false discovery rates (x-axis). Trendsceek (sky-blue) which is the combined test of Trendsceek, detected almost none. (c) In situ hybridization of three representative genes (Reln, Cldn5, and Camk2a) obtained from the database of the Allen Brain Atlas. Reln is spatially expressed in the mitral layer and glomeruli layer. Cldn5 is spatially expressed in the nerve layer. Camk2a is spatially expressed in the granular layer. (d) Spatial expression pattern for representative genes in the spatial transcriptomics data. Top row shows the same three genes as shown in (c) along with their p-values from SPARK (inside parenthesis). These genes are only identified by SPARK, but not by the other two methods. Bottom row shows spatial expression patterns for three additional known marker genes (Doc2g, Kctd12, and Penk) for different layers in the mouse olfactory bulb: Doc2g for mitral layer; Kctd12 for nerve layer; and Penk for granular layer. Color represents relative gene expression level (purple: high; green: low). (e) Three distinct spatial expression patterns summarized based on the 772 SE genes identified by SPARK, along with dendrogram displaying the clustering of these three main patterns (pattern I: 119 genes; pattern II: 270 genes; pattern III: 321 genes). (f) Venn diagram shows the overlap between SE genes identified by SPARK and SpatialDE based on an FDR cutoff 0.05. (g) Bar plot shows the percentage of SE genes identified by SPARK (orange/pink) or SpatialDE (orange/purple) that are also validated in two gene lists, one from a literature (left) and the other from the Harmonizome database (right). The orange bar represents the percentage of SE genes identified by both SPARK and SpatialDE that are in either of the gene lists; the pink bar represents the percentage of unique SE genes identified by SPARK that are in either of the gene lists; the purple bar represents the percentage of unique SE genes identified by SpatialDE that are in either of the gene lists. (h) Bubble plot shows -log10 p-values for pathway enrichment analysis on 772 SE genes obtained by SPARK based on an FDR cutoff of 0.05. The p-values are from the clusterProfiler analysis and the dash line indicates a p-value cutoff of 0.05. Gene sets are colored by three categories: GO biological process (blue), GO molecular function (purple), and GO cellular component (yellow).
Figure 3:
Figure 3:. Analyzing the human breast cancer data (n = 250 spots)
(a) Quantile-quantile plot of the observed -log10 p-values from different methods are plotted against the expected -log10 p-values under the null in the permuted data. p-values are combined across ten permutation replicates. Compared methods include SPARK (pink), SpatialDE (purple), Trendsceek.E (light salmon) which is the Emark test of Trendsceek, Trendsceek. ρ (yellow-green) which is the Markcorr test of Trendsceek, Trendsceek. γ (light green) which is the Markvario test of Trendsceek, and Trendsceek.V (wheat) which is the Vmark test of Trendsceek. (b) Power plot shows the number of genes with spatial expression pattern (y-axis) identified by different methods at a range of false discovery rates. Trendsceek (sky-blue) which is the combined test of Trendsceek, detected only a few. (c) Bar plot shows the percentage of SE genes identified by SPARK (orange/pink) or SpatialDE (orange/purple) that are also validated in two gene lists, one from the CancerMine database (left) and the other from the Harmonizome database (right). The orange bar represents the percentage of SE genes identified by both SPARK and SpatialDE that are in either of the gene lists; the pink bar represents the percentage of unique SE genes identified by SPARK that are in either of the gene lists; the purple bar represents the percentage of unique SE genes identified by SpatialDE that are in either of the gene lists. (d) Venn diagram shows the overlap between SE genes identified by SPARK and SpatialDE based on an FDR cutoff 0.05. (e) Spatial expression pattern for five genes (HLA-B, EEF1A1, ERBB2, MMP14, and CD44) that are only identified by SPARK but not by the other two methods. The p-values for the five genes from SPARK are shown inside parenthesis. Color represents relative gene expression level (purple: high; green: low). For reference, the hematoxylin and eosin (H&E) staining of a breast cancer biopsy from ref. is shown in the top left panel. The dark staining in the H&E panel represents potential tumors. These five genes are previously known molecular markers associated with tumor induced immune response (HLA-B), growth factor (ERBB2), or metastasis (EEF1A1, MMP14 and CD44). Panel e is reproduced with permission from ref.. (f) Bubble plot shows -log10 p-values for pathway enrichment analysis on 290 SE genes obtained by SPARK based on an FDR cutoff of 0.05. The p-values are from the clusterProfiler analysis and the dash line indicates a p-value cutoff of 0.05 The dash line indicates a p-value cutoff of 0.05. Gene sets are colored by categories: GO biological process (blue), GO molecular function (purple), and GO cellular component (yellow).
Figure 4:
Figure 4:. Analyzing the mouse hypothalamus data (n = 4,975 cells).
(a) Quantile-quantile plot of the observed -log10 p-values from different methods are plotted against the expected -log10 p-values under the null in the permuted data. The p-values are combined across one hundred permutation replicates. Compared methods include SPARK (pink) and SpatialDE (purple). Results for Trendsceek (sky-blue) are not included here due to its heavy computational burden. (b) Power plot shows the number of genes with spatial expression pattern (y-axis) identified by different methods vs the number of blank control genes identified at the same threshold. (c) Spatial distribution of all major cell classes on the 1.8-mm by 1.8-mm imaged slice from a single female mouse. Cells are colored by cell classes shown in the legend, where the cell class information are obtained from the reference. (d) Spatial distribution of four main cell classes. The spatial distribution of the remaining five cell classes are shown in Supplementary Fig. 8. The cell classes are represented by colored dots while all other background cells are shown as gray dots. Spatial expression pattern for four representative genes (Gad1, Mbp, Cd24a, and Myh11) that are identified by SPARK are shown. The p-values for the four genes from SPARK are shown inside parenthesis. Color represents relative gene expression level (purple: high; green: low).

References

    1. Chen KH, Boettiger AN, Moffitt JR, Wang S & Zhuang X RNA imaging. Spatially resolved, highly multiplexed RNA profiling in single cells. Science 348, aaa6090 (2015). - PMC - PubMed
    1. Lubeck E, Coskun AF, Zhiyentayev T, Ahmad M & Cai L Single-cell in situ RNA profiling by sequential hybridization. Nat Methods 11, 360–361 (2014). - PMC - PubMed
    1. Femino AM, Fogarty K, Lifshitz LM, Carrington W & Singer RH Visualization of single molecules of mRNA in situ. Method Enzymol 361, 245–304 (2003). - PubMed
    1. Lovatt D et al. Transcriptome in vivo analysis (TIVA) of spatially defined single cells in live tissue. Nat Methods 11, 190–196 (2014). - PMC - PubMed
    1. Simone NL, Bonner RF, Gillespie JW, Emmert-Buck MR & Liotta LA Laser-capture microdissection: opening the microscopic frontier to molecular analysis. Trends In Genetics 14, 272–276 (1998). - PubMed

Publication types