Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Aug 13;24(8):1116.
doi: 10.3390/e24081116.

Multiscale Methods for Signal Selection in Single-Cell Data

Affiliations

Multiscale Methods for Signal Selection in Single-Cell Data

Renee S Hoekzema et al. Entropy (Basel). .

Abstract

Analysis of single-cell transcriptomics often relies on clustering cells and then performing differential gene expression (DGE) to identify genes that vary between these clusters. These discrete analyses successfully determine cell types and markers; however, continuous variation within and between cell types may not be detected. We propose three topologically motivated mathematical methods for unsupervised feature selection that consider discrete and continuous transcriptional patterns on an equal footing across multiple scales simultaneously. Eigenscores (eigi) rank signals or genes based on their correspondence to low-frequency intrinsic patterning in the data using the spectral decomposition of the Laplacian graph. The multiscale Laplacian score (MLS) is an unsupervised method for locating relevant scales in data and selecting the genes that are coherently expressed at these respective scales. The persistent Rayleigh quotient (PRQ) takes data equipped with a filtration, allowing the separation of genes with different roles in a bifurcation process (e.g., pseudo-time). We demonstrate the utility of these techniques by applying them to published single-cell transcriptomics data sets. The methods validate previously identified genes and detect additional biologically meaningful genes with coherent expression patterns. By studying the interaction between gene signals and the geometry of the underlying space, the three methods give multidimensional rankings of the genes and visualisation of relationships between them.

Keywords: feature selection; multiscale data analysis; persistent Laplacian; single cell transcriptomics; topological signal processing.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure A1
Figure A1
Seurat clusters on PBMC data [34] from Seurat VST Vignette [36] numbered according to the vignette and interpretations of overarching cell types inferred from previous results. The cells in this data set divide into broad clusters corresponding to the cell types found in peripheral blood mononuclear cells: lymphocytes (T cells, NK cells, B cells), monocytes, and dendritic cells, as also platelets which are not mononuclear but are found in this specific data set. The DGE analysis from Seurat (non-parametric Wilcoxon rank sum test [41]) defines twelve smaller clusters, in particular sublcustering T cells, NK cells and monocytes, and searches only for differentially expressed genes on these subclusters.
Figure A2
Figure A2
Eigenscore ranks for PBMC data [34]. On the top row are plots of the Laplacian eigenvectors, coloured by sign (red positive, purple negative). For each eigenvector ei, genes are listed with the highest alignment (eigi+) and highest anti-alignment (eigi) with ei. Below the table are a selection of genes ranked highly by eigenscores. For example gene FTL shown below the table is strongly expressed on the monocyte cluster on the right, which is purple (negative) for both e1 and e2, hence FTL has high scores on eig1 and eig2.
Figure A3
Figure A3
Expression of relevant marker genes in the T cell data set.
Figure A4
Figure A4
A weighted graph constructed from mouse foetal liver cells sampled from days 10–17 during development. Parent cell type hepatoblasts differentiate into two daughter cell types, cholangiocytes and hepatocytes.
Figure 1
Figure 1
The eigenscore method (defined in Section 2.2) demonstrated here on a graph constructed by taking 100 random points each from of four touching balls in 30 dimensions and connecting them via a 15-nearest-neighbour graph. (A) Laplacian eigenvectors e1 and e2 distinguish the left and right two clusters and the top and bottom two clusters, respectively. (B) Different graph signals align or anti-align differently with the two eigenvectors, resulting in a plot of eigenscore (eig1, eig2)-space that differentiates the various signals. A random signal plots near the origin.
Figure 2
Figure 2
The graph on the left displays community structures at four different scales, exemplified by the groups A, B, C and D. When computing the mean pairwise variation of information (right) as a function of scale (Markov time), we find local minima corresponding to resolutions A (256 communities), B (64 communities), C (16 communities) and D (4 communities). Figure inspired by [32].
Figure 3
Figure 3
We construct a graph with three communities, all of different sizes. (A) The VI (on y-axis, VI is 0 except for a brief spike around t=3.35) identifies resolutions t1, at which all three communities are identified, and t2, at which two communities are identified (note that due to the simplicity of the graph, there are intervals of local minima instead of points; we pick t1 before the spike and t2 after). In (B), we calculate the MLS at t1 and t2 (given by black circles) of three signals that are equal to 1 on one of the t1-communities (constant part of the signal is highlighted by arrows) and uniformly random elsewhere, and one completely random signal. The signal that is constant on the largest cluster (bottom left) is identified as highly consistent at both times. The random signal (top right) is identified as inconsistent at both times. Conversely, the signal constant on the smallest community (top left) has a high MLS at t2 relative to the MLS at t1, separating it from the signal constant on the community of intermediate size (centre).
Figure 4
Figure 4
The persistent Rayleigh quotient for cell differentiation. (A) (left) Signals (genes) on the graph that we aim to differentiate. (right) The model for the bifurcating differentiation process. (B) The effects on the graph and graph Laplacian after applying the Kron reduction process to the daughter cells. (C) The normalised Rayleigh quotients of (x-axis) full Laplacian Lt1t0t1t0 and (y-axis) persistent Laplacian L0t1t0 for binary functions on the graph representing high and low gene expression of a particular gene. The persistent Rayleigh quotient separates these genes based on relevance to the bifurcation: g1 is expressed in all cell types, g2 is expressed in the parent and one daughter cell type, g3 is expressed only in both daughter cell types, g4 is expressed only in one daughter cell type.
Figure 5
Figure 5
Lambrechts et al. [38] classified T cells into six sub-cell types based on marker genes. To reduce overplotting and assist visualisation, points with non-zero expression were plotted on top for this figure and Figure A3.
Figure 6
Figure 6
Geometry of cell space and gene space. (A) Cell types in PBMC data [34]. (B) UMAP of genes set in eigenscore space for eigenvectors 1–16. Genes (dots) are colour-coded for the logarithm of the norm of the vector in 16-dimensional eigenscore space. Genes with similar expression patterns in the PBMC single-cell data [34] plot close together in eigenscore space, and expression patterns vary continuously as we move through this space. The outward branches I–VI correspond to genes that are expressed highly on specific groups of cells.
Figure 7
Figure 7
Eigenscores compared to differential gene expression (DGE) on PBMC data set [34]. (A) Comparative study of DGE ranking using Seurat clustering and a non-parametric Wilcoxon rank sum test (log of rank computed from adjusted p-value on x-axis) versus ranking by norm in eigenscore space (log of eigenscore rank of 16 lowest frequencies on y-axis). Example genes in top 100 for one ranking but not the other shown on the sides. (B) Top 100 genes ranked by adjusted p-value in DGE marked on the eigenscore UMAP plot of genes from Figure 6. Two regions in the UMAP not found in the top of DGE are branch V from Figure 6B (T cell and lymphocyte genes that are expressed in larger groups of cells); branch VI (genes expressed in RRM2+ cluster that is not found by DGE). (C) Quantitative comparison of gene ranks given by adjusted p-value in DGE versus norm in 16-dimensional eigenscore space.
Figure 8
Figure 8
(A) UMAP of genes from T cell data set [38] in eigenscore space for eigenvectors 1–19, colour-coded for the logarithm of norm of the vector in 19-dimensional eigenscore space. Genes with similar expression group together and reveal substructure in the data set. Some genes have unique expression patterns not matched by other genes. Boxed genes represent a group of genes with similar expression whereas unboxed genes represent isolated gene behaviour. (B) Top 20 genes ranked by norm in 1–19 dimensional eigenscore space.
Figure 9
Figure 9
Multiscale Laplacian scores of PBMC data set [34]. (A) The graph of variation of information of community structures returned by 100 iterations of the Louvain algorithm at each Markov time. Local minima indicate stable community structures and, hence, scales of interest. The community structures at three such minima are shown by colourings of UMAP plots. (B) Left: three scatter plots comparing the multiscale Laplacian scores of genes (grey dots) at successive times to one another (upper two) and of t3 to the combinatorial Laplacian score (in all plots, axes are truncated). We highlight 6 genes of interest (annotated). Middle and Right: UMAP plots visualising the gene expression of six genes selected based on their MLS.
Figure 10
Figure 10
Multiscale Laplacian score of human T cell data set [38]. (A) The graph of variation of information of community structures. Again, local minima indicate scales of interest. Community structures at three scales are picked out. (B) (Left): three scatter plots comparing the multiscale Laplacian scores of genes (grey dots) at successive times to one another (left and middle plot) and of t3 to the combinatorial Laplacian score (in all plots, axes are truncated). We highlight 6 genes of interest (black dots; annotated). (Middle and Right): UMAP plots visualising the gene expression of six genes selected based on their MLS.
Figure 11
Figure 11
The persistent Rayleigh quotient separates genes by their role in a cell differentiation process. The PRQ is parameterised by birth (i) and death (j), each pair (i,j) assigning a non-negative number to every gene. We plot these values for each gene for (i=7,j=7) on the x-axis and (i=2,j=7) on the y-axis on subfigure (C). Selected for display (A,B,D,E) are top differentially expressed genes from [42] on the data from [39] (see Figure A4). Genes Tubb5, Mdk, and Igfbp1 are expressed in parent and one daughter cell lineage, hepatoblast to (A) cholangiocyte or (B) hepatocyte and lie above the diagonal. Genes Aldob and Mt2 are expressed in both daughter cell types but not in the parent cell type (D), and they lie below the diagonal. Genes Ahsg and Fabp1 are only expressed in one daughter cell type (E) and lie on the diagonal (compare with Figure 4).

References

    1. Hao Y., Hao S., Andersen-Nissen E., Mauck W.M., III, Zheng S., Butler A., Lee M.J., Wilk A.J., Darby C., Zagar M., et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184:3573–3587. doi: 10.1016/j.cell.2021.04.048. - DOI - PMC - PubMed
    1. Wolf F.A., Angerer P., Theis F.J. SCANPY: Large-scale single-cell gene expression data analysis. Genome Biol. 2018;19:1–5. doi: 10.1186/s13059-017-1382-0. - DOI - PMC - PubMed
    1. McInnes L., Healy J., Saul N., Großberger L. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw. 2018;3:861. doi: 10.21105/joss.00861. - DOI
    1. Becht E., McInnes L., Healy J., Dutertre C.A., Kwok I.W., Ng L.G., Ginhoux F., Newell E.W. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 2019;37:38–44. doi: 10.1038/nbt.4314. - DOI - PubMed
    1. Jeitziner R., Carrière M., Rougemont J., Oudot S., Hess K., Brisken C. Two-tier mapper: A user-independent clustering method for global gene expression analysis based on topology. arXiv. 20171801.01841 - PubMed

LinkOut - more resources