Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb 8;3(3):100443.
doi: 10.1016/j.patter.2022.100443. eCollection 2022 Mar 11.

EMBEDR: Distinguishing signal from noise in single-cell omics data

Affiliations

EMBEDR: Distinguishing signal from noise in single-cell omics data

Eric M Johnson et al. Patterns (N Y). .

Abstract

Single-cell "omics"-based measurements are often high dimensional so that dimensionality reduction (DR) algorithms are necessary for data visualization and analysis. The lack of methods for separating signal from noise in DR outputs has limited their utility in generating data-driven discoveries in single-cell data. In this work we present EMBEDR, which assesses the output of any DR algorithm to distinguish evidence of structure from algorithm-induced noise in DR outputs. We apply EMBEDR to DR-generated representations of single-cell omics data of several modalities to show where they visually show real-not spurious-structure. EMBEDR generates a "p" value for each sample, allowing for direct comparisons of DR algorithms and facilitating optimization of algorithm hyperparameters. We show that the scale of a sample's neighborhood can thus be determined and used to generate a novel "cell-wise optimal" embedding. EMBEDR is available as a Python package for immediate use.

Keywords: ATAC-seq; UMAP; cell-type identification; clustering; data visualization; dimensionality reduction; quality assessment; single-cell RNA sequencing; single-cell analysis; t-SNE.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Features of dimensionally reduced data are sensitive to the choice of algorithm and algorithmic settings (A and C) Four dimensionally reduced representations of RNA-seq measurements from 4,771 bone marrow cells collected by the Tabula Muris Consortium generated by t-SNE at kEff15 (A) and 150 (C) (perplexity = 10 and 120, respectively; custom variation of the openTSNE implementation51,52) and by UMAP at n_neighbors = k = 15 and 150. Ten previously annotated cell types provided by Schaum et al. are colored and labeled. The same cells are colored and labeled in each panel. (B) The number of nearest neighbors, k, is set to its default value, 15, in UMAP. Following the method in supplemental section S3, we use t-SNE with a similar number of nearest neighbors (kEff = 15) in (A). (C and D) We visualize the data using t-SNE and UMAP, respectively, at a much larger number of nearest neighbors: kEff ≈ 150 in (C) and k = 150 in (D).
Figure 2
Figure 2
A schematic of the EMBEDR algorithm (A) The data (5,037 FACS-sorted marrow cells from Schaum et al. shown as a heatmap) are embedded in 2D using a DR method several times (here: UMAP with k = n_neighbors = 100). For each sample, the distances to neighboring samples are calculated in both the original data, xixj, and the low-dimensional embedding, yiyj. An example cell is illustrated by a red star in each of the embeddings. These distance distributions are compared to calculate EESi,n, a quality score for each cell in each embedding. (B) The same procedure as in (A) is conducted using null datasets constructed via marginal resampling (see Figure 3). A purple star indicates a sample point in each null embedding. (C) The individual EESi,n values are compared with the null distribution of EES to estimate a p value for each cell’s embedding quality. This p value corresponds to the empirical likelihood that the null data could generate an observed or better embedding quality. (D) The UMAP embedding of the data from (A) is shown. Cells in this embedding are colored according to the p values calculated in (C) , so that embedding quality can easily be visualized across an embedding. The light purple cells are those whose neighborhoods are better preserved than expected by random chance.
Figure 3
Figure 3
An overview of marginal resampling for generating null datasets (A) Gene expression data for real and resampled scRNA-seq data (FACS-sorted marrow cells8) are shown as heatmaps. (B) The first and second principal component of the data in (A) are plotted against each other, and the corresponding marginal distributions are shown to the top and right. Kernel density estimates are also plotted on the marginal distributions. (C and D) The effect of marginal resampling to generate null distributions is shown, where the data and a null dataset are embedded using UMAP at k = 15 and t-SNE at kEff60, respectively, which correspond to the default parameters for those algorithms.
Figure 4
Figure 4
Optimizing DR algorithm hyperparameters generates high-quality embeddings A total of 4,771 bone marrow cells from several mice were embedded with t-SNE 5 times at several values of kEff and the EMBEDR p value was calculated using 10 null embeddings. (A–C) Embeddings generated at three interesting values of kEff; each cell is colored by the EMBEDR p value (shown by the color bar) in (D). In (A), kEff40 corresponds to the default t-SNE parameter (perplexity = 30) in most implementations of t-SNE., (C) An embedding generated using kEff1200 (perplexity = 1,000), which corresponds to the largest fraction of cells being well represented in the lower-dimensional embedding. Similarly, (B) shows the results at kEff150 (perplexity = 100), which corresponds to a second, smaller minimum in the p values. (D) The distributions of p values are shown as box-and-whisker plots over each value of kEff and the median of the boxplot at kEff 1,200 indicates that a substantial fraction of cells are best embedded at that hyperparameter value.
Figure 5
Figure 5
EMBEDR facilitates direct comparisons of DR methods (A–D) A total of 4,711 cells from the Tabula Muris marrow tissue are embedded by t-SNE and UMAP at default (A and B) and EMBEDR-optimized (C and D) numbers of nearest neighbors. Each cell in each embedding is colored by the EMBEDR p value according to the color bars on the right. The p values are calculated as in Figure 2 and in supplemental Section S4 using Nembed=25 applications of t-SNE/UMAP to the data and Nembed=10 embeddings of null data. In the boxes below each panel, the number (percentage) of cells at each p value threshold are shown (indicated by the corresponding color), with the threshold containing a plurality of cells shown in bold.
Figure 6
Figure 6
Different cell types are best embedded at a variety of scales Using annotations from the Tabula Muris project, the embedding quality of different cell types in the Marrow data can be examined individually across values of kEff. (A) Six identified cell types from the bone marrow tissue are shown, where each cell with a given annotation is shown as an individual line. The colored boxes indicate the median p value across all cells with that annotation, and the solid lines indicate the 90th percentiles. Similar plots for all cell types are shown in Figure S13. Embeddings at kEff ≈ 150 and 1,200 are shown in (B and C), respectively. (B and C) The cells corresponding to each cell type are highlighted with the same color as in (A). Cells with an EMBEDR p value below 103 (the gray line in A) are opaque, while other cells with a highlighted annotation are lightly shaded. The fractions of such cells in an annotation are shown in the colored boxes below the embeddings. Other cell types are shown in gray for context.
Figure 7
Figure 7
A cell-wise optimized embedding reveals clear biological signals Adapting t-SNE to use a different scale for each sample in the Tabula Muris marrow data generates a well-structured representation of the data. (A) The unlabeled embedding is presented. (B and C) To generate this embedding, the scale at which a cell’s p value was minimized was used to set kEff for that cell. This kEff is shown in (B) and the minimal p value achieved by a cell across the sweep is shown in (C). (D) Applying DBSCAN with eps set based on the pairwise distance (PWD) distribution of cells in the embedding (specifically, the 1.5th percentile of PWDs) detected the seven indicated clusters. Any Tabula Muris cell-type annotation for which more than 20 cells overlapped with a DBSCAN label was given a different shade of the cluster color. (E) These cell annotations and colors are shown as a confusion table.

References

    1. Guo G., Huss M., Tong G.Q., Wang C., Li Sun L., Clarke N.D., Robson P. Resolution of cell fate decisions revealed by single-cell gene expression analysis from zygote to blastocyst. Dev. Cell. 2010;18:675–685. doi: 10.1016/j.devcel.2010.02.012. - DOI - PubMed
    1. Dalerba P., Kalisky T., Sahoo D., Rajendran P.S., Rothenberg M.E., Leyrat A.A., Sim S., Okamoto J., Johnston D.M., Qian D., et al. Single-cell dissection of transcriptional heterogeneity in human colon tumors. Nat. Biotechnol. 2011;29:1120–1127. doi: 10.1038/nbt.2038. - DOI - PMC - PubMed
    1. Klein A.M., Mazutis L., Akartuna I., Tallapragada N., Veres A., Li V., Peshkin L., Weitz D.A., Kirschner M.W. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161:1187–1201. doi: 10.1016/j.cell.2015.04.044. - DOI - PMC - PubMed
    1. Macosko E.Z., Basu A., Satija R., Nemesh J., Shekhar K., Goldman M., Tirosh I., Bialas A.R., Kamitaki N., Martersteck E.M., et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161:1202–1214. doi: 10.1016/j.cell.2015.05.002. - DOI - PMC - PubMed
    1. Farrell J.A., Wang Y., Riesenfeld S.J., Shekhar K., Regev A., Schier A.F. Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis. Science. 2018;360:eaar3131. doi: 10.1126/science.aar3131. - DOI - PMC - PubMed

LinkOut - more resources