Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug;39(8):1000-1007.
doi: 10.1038/s41587-021-00867-x. Epub 2021 Apr 19.

Iterative single-cell multi-omic integration using online learning

Affiliations

Iterative single-cell multi-omic integration using online learning

Chao Gao et al. Nat Biotechnol. 2021 Aug.

Abstract

Integrating large single-cell gene expression, chromatin accessibility and DNA methylation datasets requires general and scalable computational approaches. Here we describe online integrative non-negative matrix factorization (iNMF), an algorithm for integrating large, diverse and continually arriving single-cell datasets. Our approach scales to arbitrarily large numbers of cells using fixed memory, iteratively incorporates new datasets as they are generated and allows many users to simultaneously analyze a single copy of a large dataset by streaming it over the internet. Iterative data addition can also be used to map new data to a reference dataset. Comparisons with previous methods indicate that the improvements in efficiency do not sacrifice dataset alignment and cluster preservation performance. We demonstrate the effectiveness of online iNMF by integrating more than 1 million cells on a standard laptop, integrating large single-cell RNA sequencing and spatial transcriptomic datasets, and iteratively constructing a single-cell multi-omic atlas of the mouse motor cortex.

PubMed Disclaimer

Conflict of interest statement

Competing Interests

A patent application on LIGER has been submitted by The Broad Institute, Inc., and The General Hospital Corporation with J.D.W. listed as an inventor. The remaining authors declare no competing interests.

Figures

Extended Data Fig. 1
Extended Data Fig. 1. Convergence behavior for online iNMF and batch iNMF algorithms on scRNA-seq data from the adult mouse brain, human PBMC and human pancreas.
The online iNMF algorithm exhibits faster convergence and better objective minimization after a fixed amount of training time. The advantage of the online algorithm in convergence speed is more apparent for larger datasets. a-c, Adult mouse brain (n=691,962cells, 9 individual datasets). d-f, Human PBMCS (n=13,999cells, 2 individual datasets). g-i, Human pancreas (n=14,890cells, 8 individual datasets). Center lines of box plots show the median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; and points are outliers.
Extended Data Fig. 2
Extended Data Fig. 2. Online and batch iNMF yield highly similar UMAP visualizations.
We performed online iNMF and batch iNMF on data from mouse cortex (n=255,353cells), human PBMC (n=13,999cells), and human pancreas (n=14,890cells). Online iNMF and batch iNMF produce very similar visualizations, suggesting that the approaches give very similar dataset alignment and cluster preservation. We subsequently confirmed this qualitative observation using quantitative metrics.
Extended Data Fig. 3
Extended Data Fig. 3. Benchmarking integration across data modalities (RNA+ATAC).
5,000 cells from the snRNA-seq dataset and 5,000 cells from the snATAC-seq dataset from MOP data collection were integrated using four different methods. The cells are exhibited in 2-dimensional UMAP space and colored by dataset.
Extended Data Fig. 4
Extended Data Fig. 4. Performing online iNMF in three scenarios produces similar results.
These analyses were carried out separately to integrate 8 MOp datasets (scRNA-seq, snRNA-seq, snATAC-seq and snmC-seq, n=408,885) using online iNMF in scenario 1 (a), scenario 2 (b), and scenario 3 (c). The results are visualized in UMAP coordinates and the cells are colored by the cell type annotations from Fig. 6.
Figure 1.
Figure 1.. Overview of the online iNMF algorithm.
a, Schematic of integrative nonnegative matrix factorization (iNMF): the input single-cell datasets are jointly decomposed into shared (W) and dataset-specific (Vi) metagenes and corresponding “metagene expression levels” or cell factor loadings (Hi). These metagenes and cell factor loadings provide a quantitative definition of cell identity and how it varies across biological settings. b-d, Three different scenarios in which online learning can be used for single-cell data integration. (b) Scenario 1: the single-cell datasets are large but fully observed. Online iNMF processes the data in random mini-batches, enabling memory usage and/or disk storage independent of dataset size. Each cell may be used multiple times in different epochs of training to update the metagenes. (c) Scenario 2: the datasets arrive sequentially, and online iNMF processes the datasets as they arrive, using each cell to update the metagenes exactly once. (d) Scenario 3: online iNMF is performed as in scenario 1 or scenario 2 to learn W and Vi. Then cell factor loadings for the newly arriving dataset are calculated using the shared metagenes (W) learned from previously processed datasets. The new dataset is not used to update the metagenes.
Figure 2.
Figure 2.. Online iNMF converges much faster than previously published batch algorithms.
a,b, The online iNMF algorithm converges much more rapidly to a similar or better objective function value compared to the previously published batch methods--alternating nonnegative least squares (ANLS) and multiplicative updates (Mult)—on both training and testing sets. c, Box plots comparing the objective function values achieved by applying online and batch iNMF algorithms on the mouse cortex data (n=255,353) after a fixed amount of training time. Center line shows the median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; and points are outliers. d-e, The convergence behavior of online iNMF is nearly identical for mini-batch sizes from 1,000 to 10,000. f, The online iNMF algorithm becomes increasingly efficient (in terms of decrease in objective function value per unit time) as dataset size increases. The time required for the algorithm to converge does not significantly increase with growing dataset size once the dataset size exceeds 50,000 cells.
Figure 3.
Figure 3.. Benchmark of online iNMF, batch iNMF, Harmony, and Seurat.
The data are sampled from the adult mouse cortex (n=10,000, 50,000, 100,000, 200,000, 255,353 cells, 2 individual datasets), human PBMC (n=13,999cells, 2 individual datasets) and human pancreas (n=14,890cells, 8 individual datasets). a, The runtime and peak memory usage required for online iNMF, batch iNMF, Harmony and Seurat to integrate the frontal and posterior cortex datasets. b,c, Quantitative assessment of data integration and low-dimensional embedding carried out by four methods on the human PBMC and human pancreas datasets. Higher values are better for all 4 metrics. Error bars indicate standard deviation across 100 random initializations. The results from iNMF approaches (100 initializations each) are presented as meanvalues±standard deviation, while Harmony and Seurat were only run once.
Figure 4.
Figure 4.. Joint analysis of nine regions of the adult mouse brain (n=691,962cells) using online iNMF.
a, UMAP visualization of the iNMF factors learned for each brain region, colored by published cell class. b, Dot plot showing the proportion of each of 40 clusters inferred from iNMF in each brain region. c, Proportion of cells from each cluster in every cell type. The cells in each cluster mostly correspond to a single cell type.
Figure 5.
Figure 5.. Online iNMF integrates large single-cell RNA-seq and spatial transcriptomic datasets.
a, The number of cells per cell type in scRNA-seq (n=193,155 cells) and Slide-seq (n=59,858beads) datasets from mouse hippocampus. b, Number of cell types assigned to each bead in the Slide-seq analysis. c, Slide-seq beads colored by labels derived from projection onto scRNA-seq data using online iNMF (scenario 3). The coordinates of each bead reflect its spatial position within the tissue. d, UMAP plot of cell factor loadings (online iNMF, scenario 1) for scRNA-seq data from mouse hippocampus. e, UMAP plot of MERFISH cells from mouse hypothalamus (n=1,026,840cells), colored by published cluster assignments. The UMAP coordinates are derived from online iNMF (scenario 3) integration of MERFISH and scRNA-seq data. f, UMAP plot of scRNA-seq cells from mouse hypothalamus (n=31,250cells), colored by published cluster assignments. The UMAP coordinates are derived from online iNMF (scenario 3) integration of MERFISH and scRNA-seq. g, MERFISH slices, ordered from anterior to posterior, colored by labels derived from the online iNMF integration. The coordinates of each cell reflect its spatial position within the tissue.
Figure 6.
Figure 6.. Iterative refinement of cell identity using multiple single-cell modalities from the mouse primary motor cortex.
We integrated four scRNA-seq datasets, two snRNA-seq datasets, one snATAC-seq dataset and one snmC-seq dataset (n=408,885neurons). a, Sequential integration of six scRNA-seq datasets (scenario 2). Each panel shows a UMAP plot using cell factors obtained after adding an additional dataset. b, UMAP plot of cell factors obtained by adding snATAC-seq to the latent space learned from six RNA datasets in a (scenario 2). c, UMAP plot of cell factors obtained by adding DNA methylation data (snmC-seq, abbreviated “MET”) to the latent space learned from the seven datasets shown in b (scenario 2). d, Clusters obtained using the cell factor loadings of all eight aligned datasets. The clusters were named using marker genes from Tasic et al.

Comment in

References

    1. Ye Z & Sarkar CA Towards a Quantitative Understanding of Cell Identity. Trends Cell Biol 28, 1030–1048 (2018). - PMC - PubMed
    1. Stuart T et al. Comprehensive Integration of Single-Cell Data. Cell 177, 1888–1902.e21 (2019). - PMC - PubMed
    1. Stuart T & Satija R Integrative single-cell analysis. Nat. Rev. Genet 20, 257–272 (2019). - PubMed
    1. Korsunsky I et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019). - PMC - PubMed
    1. Welch JD et al. Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity. Cell 177, 1873–1887.e17 (2019). - PMC - PubMed

Publication types

LinkOut - more resources