This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Aug 19:2023.07.20.549945.

doi: 10.1101/2023.07.20.549945.

Voyager: exploratory single-cell genomics data analysis with geospatial statistics

Lambda Moses¹, Pétur Helgi Einarsson², Kayla Jackson¹, Laura Luebbert¹, A Sina Booeshaghi¹, Sindri Antonsson², Nicolas Bray³, Páll Melsted², Lior Pachter^{1

4}

Affiliations

¹ Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.
² Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, Reykjavík, Iceland.
³ Boston, MA.
⁴ Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA.

PMID: 37645732
PMCID: PMC10461913
DOI: 10.1101/2023.07.20.549945

Voyager: exploratory single-cell genomics data analysis with geospatial statistics

Lambda Moses et al. bioRxiv. 2023.

[Preprint]. 2023 Aug 19:2023.07.20.549945.

doi: 10.1101/2023.07.20.549945.

Authors

Lambda Moses¹, Pétur Helgi Einarsson², Kayla Jackson¹, Laura Luebbert¹, A Sina Booeshaghi¹, Sindri Antonsson², Nicolas Bray³, Páll Melsted², Lior Pachter^{1

4}

Affiliations

¹ Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.
² Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, Reykjavík, Iceland.
³ Boston, MA.
⁴ Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA.

PMID: 37645732
PMCID: PMC10461913
DOI: 10.1101/2023.07.20.549945

Abstract

Exploratory spatial data analysis (ESDA) can be a powerful approach to understanding single-cell genomics datasets, but it is not yet part of standard data analysis workflows. In particular, geospatial analyses, which have been developed and refined for decades, have yet to be fully adapted and applied to spatial single-cell analysis. We introduce the Voyager platform, which systematically brings the geospatial ESDA tradition to (spatial) -omics, with local, bivariate, and multivariate spatial methods not yet commonly applied to spatial -omics, united by a uniform user interface. Using Voyager, we showcase biological insights that can be derived with its methods, such as biologically relevant negative spatial autocorrelation. Underlying Voyager is the SpatialFeatureExperiment data structure, which combines Simple Feature with SingleCellExperiment and AnnData to represent and operate on geometries bundled with gene expression data. Voyager has comprehensive tutorials demonstrating ESDA built on GitHub Actions to ensure reproducibility and scalability, using data from popular commercial technologies. Voyager is implemented in both R/Bioconductor and Python/PyPI, and features compatibility tests to ensure that both implementations return consistent results.

PubMed Disclaimer

Figures

**Figure 1:**
Schematic overview of the Voyager framework. Voyager brings exploratory spatial data analysis (ESDA) methods initially developed for geospatial data to spatial -omics, with a consistent user interface for different methods. Voyager is based on the SpatialFeatureExperiment (SFE) object. In R, SFE uses sf and terra to extend SingleCellExperiment (SCE) and SpatialExperiment (SPE). In Python, SFE extends AnnData with GeoPandas. Voyager implements plotting functions for gene expression, cell attributes, and spatial analysis results. Spatial results shown in this schematic are local Moran’s I (left), correlogram (center top), Moran scatter plot (center bottom), and variogram map (right). The documentation website includes tutorials that demonstrate ESDA on data from multiple spatial -omics technologies, including Visium, Slide-seq, Xenium, CosMX, MERFISH, seqFISH, and CODEX. The website is built automatically with GitHub Actions for reproducibility, and Google Colab notebooks are automatically generated from the vignettes. Compatibility tests are used to make sure that the R and Python implementations return consistent results for core functionalities.

**Figure 2:**
Applications of Voyager on spatial transcriptomics datasets. A) In a mouse skeletal muscle dataset, the total UMI counts, or library size per spot (nCounts), are plotted in space as blue open circles and myofibers are colored in red according to their cross section areas. Only spots that intersect tissue are plotted. The H&E image is plotted on the side as a reference. B) Scatter plot of the number of genes detected per spot (nGenes) vs. nCounts, colored by mean area of myofibers that intersect each spot. C) Simulated (density plot) and observed (vertical line) difference between Moran’s I in nCounts of spots that intersect tissue (in) and that of spots that don’t (out). D) The 20 most positive and 20 most negative eigenvalues from MULTISPATI PCA of a mouse liver MERFISH dataset. As other eigenvalues were not computed, there is a break after PC20 in this plot. E) The most positive and negative gene loadings for PCs 1, 2, and 40. F) A subset of the MERFISH data showing a portal triad (near top right) and two central veins (left and bottom right), with cell polygons colored by their projections into 2 PCs with the most positive eigenvalues and the PC with the most negative eigenvalue (“PC40”). The first 2 PCs show zonation.

**Figure 3:**
Application of neighborhood view spatial statistics on non-spatial scRNA-seq. A) Violin plots of log normalized counts of the top marker gene of each Leiden cluster in the PBMC dataset. B) Moran scatter plot of nCounts in a 10X Chromium human PBMC dataset. The spatial lags were computed with the k nearest neighbors graph in PCA gene expression space. Colors indicate clusters, and point shape indicates whether the point is influential to the fit of the blue line, which is the least square fit to the scatter plot. The gray shade around the line is the 95% confidence interval of the fit. Contours show the area with the highest point density. The gray dotted lines show the mean on the x and y axes. C) Histograms of local Moran’s I values per cell of top marker genes of each cluster in the PBMC dataset, colored by cell cluster. The y axis (number of cells per bin) is log-transformed for better dynamic range. The histograms are plotted as lines instead of bars to avoid overlapping bars from different clusters. D) Concordex heatmap for the PBMC Leiden clusters. High diagonal and low off diagonal values indicate high clustering quality, or that the Leiden clusters reflect the k-nearest-neighbor graph well, but cluster 6 has somewhat lower quality.

**Figure 4:**
Comparisons between results obtained by Seurat and scanpy, and between Voyager R and Python for a mouse olfactory bulb Visium dataset. A) Comparison of Visium spot embeddings in the first 2 PCs from Seurat and scanpy with default parameters. The lines connect corresponding spots in Seurat and scanpy. B) As in A, but for Voyager R and VoyagerPy, with parameters stated in this section. C) Cosine distances between the first 20 PCA eigenvectors (gene loadings) from Seurat and scanpy (yellow), and from Voyager R and Python (blue). The dashed line is the magnitude that can be explained by machine double precision. The text part of the line is somewhat smoothed for readability but should not affect interpretation. D) Absolute values of differences in the proportion of variance explained by each of the top 20 PCs. E) Moran’s I from VoyagerPy vs. Voyager R. The blue line is y = x, showing that the results are consistent. F) Same as E but for local Moran’s I for gene S100a5. G) Plotting the local Moran’s I values in space, with the H&E image behind the spots, from Voyager R (top) and VoyagerPy (bottom).

See this image and copyright information in PMC

References

1. Moses L. & Pachter L. Publisher Correction: Museum of spatial transcriptomics. Nat. Methods 19, 628 (2022). - PubMed
1. Hao Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021). - PMC - PubMed
1. Palla G. et al. Squidpy: a scalable framework for spatial omics analysis. Nat. Methods 19, 171–178 (2022). - PMC - PubMed
1. Dries R. et al. Giotto: a toolbox for integrative analysis and visualization of spatial expression data. Genome Biol. 22, 78 (2021). - PMC - PubMed
1. Bergenstråhle J., Larsson L. & Lundeberg J. Seamless integration of image and molecular analysis for spatial transcriptomics workflows. BMC Genomics 21, 482 (2020). - PMC - PubMed

Publication types

Actions

Grants and funding

UM1 HG012077/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Voyager: exploratory single-cell genomics data analysis with geospatial statistics

Affiliations

Voyager: exploratory single-cell genomics data analysis with geospatial statistics

Authors

Affiliations

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources