Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb 9;13(1):780.
doi: 10.1038/s41467-022-28431-4.

UINMF performs mosaic integration of single-cell multi-omic datasets using nonnegative matrix factorization

Affiliations

UINMF performs mosaic integration of single-cell multi-omic datasets using nonnegative matrix factorization

April R Kriebel et al. Nat Commun. .

Abstract

Single-cell genomic technologies provide an unprecedented opportunity to define molecular cell types in a data-driven fashion, but present unique data integration challenges. Many analyses require "mosaic integration", including both features shared across datasets and features exclusive to a single experiment. Previous computational integration approaches require that the input matrices share the same number of either genes or cells, and thus can use only shared features. To address this limitation, we derive a nonnegative matrix factorization algorithm for integrating single-cell datasets containing both shared and unshared features. The key advance is incorporating an additional metagene matrix that allows unshared features to inform the factorization. We demonstrate that incorporating unshared features significantly improves integration of single-cell RNA-seq, spatial transcriptomic, SNARE-seq, and cross-species datasets. We have incorporated the UINMF algorithm into the open-source LIGER R package ( https://github.com/welch-lab/liger ).

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of the UINMF algorithm for integrating single-cell datasets with partially overlapping features.
a Schematic representation of the matrix factorization strategy (top) and optimization problem formulation (bottom). The addition of a factor matrix Ui allows for unshared features to be utilized in joint matrix factorization. Each dataset (Ei) is decomposed into shared metagenes (W), dataset-specific metagenes constructed from shared features (Vi), unshared metagenes (Ui), and cell factor loadings (Hi). The incorporation of the U matrix allows unshared features that occur in only one dataset to inform the resulting integration. b UINMF can integrate data types such as scRNA-seq and snATAC-seq using both gene-centric features and intergenic information. c UINMF can integrate targeted spatial transcriptomic data with simultaneous single-cell RNA and chromatin accessibility measurements using both unshared epigenomic information and unshared genes.
Fig. 2
Fig. 2. Addition of intergenic peak information improves integration of RNA and ATAC datasets.
a Schematic illustrating how the UINMF algorithm incorporates intergenic peaks when separately integrating the RNA and ATAC measurements from a SNARE-seq dataset. We treat each data type as if it came from an independent source, and perform an integration using regular iNMF and our proposed UINMF method, which incorporates intergenic peaks. b Average alignment and FOSCTTM (Fraction of Samples Closer Than True Match) scores for iNMF, Seurat v3, Harmony, and UINMF. iNMF and UINMF are both initialized 5 different times over ten random seeds, with UINMF including an additional 7,000 intergenic features into the analysis. For nondeterministic algorithms, data are presented as mean values +/− SEM. To compare algorithm performance, we used a paired, one-sided Wilcoxon test to compare UINMF’s alignment and FOSCTTM scores to iNMF (P = 1.953 × 10−3P = 1.953 × 10−3), Seurat (P = 1.953 × 10−3, P = 0.01855), and Harmony (P = 9.766 × 10−4, P = 9.766 × 10−4), with Seurat exhibiting a significantly lower FOSCTTM score. For each algorithm, we compare 10 pairs of data points (n = 20). We factorize and cluster the cells using their RNA transcripts (c) and chromatin accessibility measures (d) separately. After integration, we use the known cell correspondences to separately plot the gene expression (e) and chromatin accessibility datasets (f) from SNARE-seq, colored by the same cell type labels. We assess the contribution of information contained within the intergenic peaks by assessing the alignment (g) and FOSCTTM (h) scores across a range of included peaks, from 0 unshared features (iNMF) to 7000 unshared intergenic bins, adding 1000 unshared features to each analysis. The bold line indicates the median data value, and the boundaries of each box are defined by the first and third data quartiles (25 and 75%, respectively). The upper (lower) whiskers extend from to highest (lowest) point within 1.5 of the interquartile range. Outliers beyond the whiskers are plotted as points. We calculate FOSCTTM and alignment scores for ten random seeds for each number of unshared features (n = 10).
Fig. 3
Fig. 3. Incorporating additional genes improves integration with STARmap data.
a Schematic of UINMF integration of the spatial transcriptomic data with the scRNA-seq data, in which U incorporates unshared genes that are captured in scRNA-seq but not in the targeted STARmap data. b UMAP of STARmap data alone. c UMAP of STARmap and scRNA integration performed with iNMF using only shared genes. d UMAP of UINMF integration, which incorporates both shared and unshared genes. Both (c) and (d) are annotated using the original scRNA-seq labels and clusterings derived from either the iNMF (c) or UINMF (d) algorithm. We compared the adjusted rand index (e) and purity (f) metrics from UINMF with iNMF (P = 3.895 × 10−10, P = 3.895 × 10−10), Seurat v3 (P = 3.895 × 10−10, P = 3.895 × 10−10), and Harmony (P = 3.895 × 10−10, P = 3.895 × 10−10) for a range of clustering resolutions, using a paired, one-sided Wilcoxon test (n = 100 values for each test, with each sample having five values across 10 resolutions). For nondeterministic algorithms, data are presented as mean values +/− SEM.
Fig. 4
Fig. 4. Incorporating additional genes locates fine cellular subtypes within 3D spatial volume.
a Dot plot showing marker gene expression in STARmap and scRNA datasets for each joint cluster. The datasets have similar marker expression, indicating that they are well aligned. Plots of three-dimensional spatial locations for different classes of cells colored by cell type: (b) all cell types; (c) excitatory neurons; (d) inhibitory neurons; (e) polydendrocytes and oligodendrocytes; and (f) endothelial cells.
Fig. 5
Fig. 5. Incorporating additional genes improves integration with osmFISH data.
a Schematic of data matrices from osmFISH and scRNA-seq. The osmFISH dataset measures only 33 genes, while the scRNA-seq dataset has many unshared genes that are incorporated during UINMF integration. b UMAP plot of osmFISH and scRNA integration with iNMF using only shared genes. c UMAP using UINMF to incorporate an additional 2,000 genes. Both (b) and (c) are annotated using the original scRNA-seq labels and either the iNMF (b) or UINMF (c) derived clusterings. d The spatial arrangement of cell types matches the known tissue structure of the cortex (ef). ARI (e) and purity (f) metrics for iNMF (P < 2.2 × 10−16, P = 7.078 × 10−8), UINMF, Seurat v3 (P = 2.505 × 10−9, P < 2.2 × 10−16), and Harmony (P = 2.982 × 10−5, P = 1.744 × 10−15). Statistical significance was evaluated using a paired, one-sided Wilcoxon test (n = 200 values for each test, with each sample having 10 values across 10 resolutions). For nondeterministic algorithms, data are presented as mean values +/− SEM.
Fig. 6
Fig. 6. Incorporating unshared chromatin and gene features to integrate spatial transcriptomic and multimodal data.
a UMAP for iNMF integration of STARmap spatial transcriptomic data and SNARE-seq RNA data only, annotated by jointly examining the SNARE-seq and STARmap labels. b Schematic of how unshared gene and chromatin accessibility data is incorporated into the integration analysis of STARmap and SNARE-seq using UINMF. c The integration is improved significantly by the inclusion of ATAC gene-centric features in the U matrix, annotated by jointly examining the SNARE-seq and STARmap labels. de The original cell type labels of STARmap cells (d) and SNARE-seq cells (e) show clear correspondence after UINMF integration.
Fig. 7
Fig. 7. Incorporating unshared chromatin and gene features improves integration of spatial transcriptomic and multimodal data.
We compared UINMF results with those from iNMF (P = 5.934 × 10−15, P = 1.046 × 10−11)), Harmony (P = 0.002927, P = 2.2 × 10−16), and Seurat v3 (P = 0.0004184, P = 2.2 × 10−16) using ARI (a) and purity (b) metrics. Comparisons were evaluated using a paired, one-sided Wilcoxon test (n = 200 values for each test, with each sample having 10 values across 10 resolutions). For nondeterministic algorithms, data are presented as mean values +/− SEM. We also confirmed that the spatial arrangement of predicted cell types in both STARmap replicate one (c) and replicate two (d) matches the known organization of the cortex.
Fig. 8
Fig. 8. The inclusion of non-orthologous genes improves the integration of cross-species data.
We use UINMF to include both orthologous and non-orthologous genes when integrating the datasets (a), and demonstrate the alignment between the two datasets (b). We also confirmed cell type correspondence by examining only the mouse cells (c) and only the lizard cells (d), both labeled with their published cell labels. To show the advantage of including the non-orthologous genes, we show the difference in ARI (e) and purity (f) scores using the originally published mouse labels, comparing UINMF performance to iNMF (P = 3.626 × 10−9, P = 6.258 × 10−4), Seurat (P < 2.2 × 10−16, P = 3.047 × 10−11), and Harmony (P < 2.2 × 10−16, P = 2.815 × 10−12). We compare algorithm performance using a paired, one-sided Wilcoxon test, where n = 200 ARI (purity) measures. We also confirm a similar trend in the ARI (g) and purity (h) scores using the original lizard labels to assess performance differences between UINMF and iNMF (P = 1.145 × 10−6, P = 0.07157), Seurat (P < 2.2 × 10−16, P = 0.8148), and Harmony (P = 1.674 × 10−5, P < 2.2 × 10−16). For nondeterministic algorithms, data are presented as mean values +/− SEM.

References

    1. Method of the Year 2019: Single-cell multimodal omics. Nat. Methods17, 1 (2020). - PubMed
    1. Chen S, Lake BB, Zhang K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 2019;37:1452–1457. doi: 10.1038/s41587-019-0290-0. - DOI - PMC - PubMed
    1. Liu, J., Huang, Y., Singh, R., Vert, J. P. & Noble, W. S. Jointly embedding multiple single-cell omics measurements. BioRxiv (2019). - PMC - PubMed
    1. Ma S, et al. Chromatin Potential Identified by Shared Single- Cell Profiling RNA Chromatin. Cell. 2020;183:1103–1116.e20. - PMC - PubMed
    1. Genomics, 10x. Chromium Next GEM Single Cell Multiome ATAC + Gene Expression Reagent Kits User Guide. (2020).

Publication types