Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Nov;15(11):3632-3662.
doi: 10.1038/s41596-020-0391-8. Epub 2020 Oct 12.

Jointly defining cell types from multiple single-cell datasets using LIGER

Affiliations

Jointly defining cell types from multiple single-cell datasets using LIGER

Jialin Liu et al. Nat Protoc. 2020 Nov.

Abstract

High-throughput single-cell sequencing technologies hold tremendous potential for defining cell types in an unbiased fashion using gene expression and epigenomic state. A key challenge in realizing this potential is integrating single-cell datasets from multiple protocols, biological contexts, and data modalities into a joint definition of cellular identity. We previously developed an approach, called linked inference of genomic experimental relationships (LIGER), that uses integrative nonnegative matrix factorization to address this challenge. Here, we provide a step-by-step protocol for using LIGER to jointly define cell types from multiple single-cell datasets. The main stages of the protocol are data preprocessing and normalization, joint factorization, quantile normalization and joint clustering, and visualization. We describe how to jointly define cell types from single-cell RNA-seq (scRNA-seq) and single-nucleus ATAC-seq (snATAC-seq) data, but similar steps apply across a wide range of other settings and data types, including cross-species analysis, single-nucleus DNA methylation, and spatial transcriptomics. Our protocol contains examples of expected results, describes common pitfalls, and relies only on our freely available, open-source R implementation of LIGER. We also provide R Markdown tutorials showing the outputs from each individual code segment. The analysis process can be performed in 1-4 h, depending on dataset size, and assumes no specialized bioinformatics training.

PubMed Disclaimer

Conflict of interest statement

Competing interests

A patent application on LIGER has been submitted by The Broad Institute, Inc. and The General Hospital Corporation with EZM, JDW and VK as inventors.

Figures

Figure 1:
Figure 1:. Diagram of high-level protocol steps.
The diagram shows the four main stages of the LIGER workflow: preprocessing and normalization; joint matrix factorization; quantile normalization and joint clustering; and visualization and downstream analysis. Note that the datasets, colors and diagrams are merely generic schematics, and are not derived from the example datasets analyzed in subsequent figures.
Figure 2:
Figure 2:. Visualizing LIGER results using UMAP and t-SNE.
a,b, UMAP plots of a LIGER analysis of 3000 control and 3000 interferon-beta stimulated PBMCs profiled by scRNA-seq, colored by dataset (a) and LIGER joint cluster assignment (b). c,d, t-SNE plots of the same LIGER analysis, colored by dataset (c) and LIGER joint cluster assignment (d). The parameters k and λ were specified as 20 and 5, as shown in the protocol.
Figure 3:
Figure 3:. LIGER enables metagene and dataset specific analysis of PBMC data.
UMAP plots showing factor loading values for each cell (top) and gene loadings (on a particular metagene; bottom) for Factor 4, which specifically loads on Cluster 3. In gene loading plots, gene names are sorted in decreasing order of magnitude of their factor loading contribution and correspond to colored points in scatterplots. The gene names in purple represent the top-loading genes. Plots are organized to show the dataset-specific metagene values for control cells, the shared metagene values common to all datasets and the dataset-specific metagene values for interferon-stimulated cells.
Figure 4:
Figure 4:. Parameter selection of the number of factors k and the tuning parameter λ.
a,b, As increases in λ (a) and k (b) result in smaller relative increases in metrics of their effectiveness, the “elbow” of the graph can be interpreted as the optimal hyperparameter value.
Figure 5:
Figure 5:. Plots of raw and normalized loading of Factor 21.
Scatter plots, with factor loadings values as y axis and cells as x axis, for both unaligned (raw) and aligned (normalized) factor loadings of Factor 21. The datasets aligned are interneurons (3212 cells) and oligodendrocytes (2524 cells) from the mouse frontal cortex.
Figure 6:
Figure 6:. Spurious alignment between datasets decreases after removing mitochondrial artifact factors.
a,b, UMAP plots of a LIGER analysis of 3212 interneurons and 2524 oligodendrocytes, with (a) and without (b) factors 21 and 23, colored by datasets.
Figure 7:
Figure 7:. Distinct cell types show poor alignment compared to alignment of control and stimulated PBMC datasets.
a, UMAP plot of a LIGER analysis of two distinct cell types, interneurons (3212 cells) and oligodendrocytes (2524 cells), showing poor alignment. b, UMAP plot of a LIGER analysis of 3000 control and 3000 interferon-stimulated PBMCs showing well-mixed alignments.
Figure 8:
Figure 8:. Diagram of differential expression analysis strategies to find shared cluster markers and cluster-specific dataset differences.
This diagram shows the strategy used by the function runWilcoxon to construct two groups of cells (shown on the left and right sides of the dotted lines), which are then used to calculate p-values for differential gene expression according to the Wilcoxon rank-sum test. The diagram highlights the differences when setting parameter compare.method to “clusters” (top) and “datasets” (bottom). The arrows indicate groups of cells that are compared in each case. When the parameter compare.method is set to “clusters”, the runWilcoxon function will compare each joint cluster (ovals in three different colors, with labels “Cluster 1” to “Cluster 3”) to all other joint clusters using cells from all datasets (both red and blue cells within each oval). The genes with significant p-values in this case represent cluster markers that are shared across datasets. When the parameter compare.method is set to “datasets”, the runWilcoxon function will compare cells from each dataset (e.g., red vs. blue) within the same joint cluster. Genes with significant p-values in this case represent cluster-specific expression differences across datasets. Note that, in both cases, normalized expression counts are used for the Wilcoxon rank-sum test--the original counts are not “batch corrected” before performing the test, because doing so would remove all dataset differences, which are often biologically significant. Because the cluster labels are obtained from the LIGER joint latent space and thus represent corresponding cell populations across datasets, such comparisons on the original normalized counts are meaningful.
Figure 9:
Figure 9:. Marker gene identified by LIGER shows consistent cell-type-specific expression across datasets.
a,b, UMAP representations of expression for gene PRF1, a marker gene of cluster 3, in control (a) and interferon-beta stimulated (b) PBMCs exhibit similar distributions.
Figure 10:
Figure 10:. Marker genes identified by LIGER show expression differences across datasets.
a,b, UMAP representations of expression for gene IFIT3, a marker gene of the interferon-stimulated dataset, shows low expression in control (a) and high expression in IFNB-stimulated (b) PBMCs. c,d, UMAP representations of expression for gene IFITM3, a marker gene of stimulation in clusters 1, 2, 5 and 7, in control (c) and IFNB-stimulated (d) PBMCs similarly shows more expression in interferon-stimulated cells.
Figure 11:
Figure 11:. LIGER enables joint clustering of BMMC data across modalities.
a,b, UMAP plots of a LIGER analysis of 12,602 scRNA-seq profiles and 6,234 nuclei profiled by snATAC-seq, showing 16 joint clusters, colored by technology (a) and LIGER cluster assignment (b). The parameters k and λ were specified as 20 and 5, as shown in the protocol.
Figure 12:
Figure 12:. Expression and chromatin accessibility of marker genes selected by LIGER show consistency across modalities.
a,b, UMAP representations of expression for genes S100A9 (a) and MS4A1 (b). c,d, UMAP representations of chromatin accessibility for genes S100A9 (c) and MS4A1 (d), which show highly similar chromatin accessibility distributions compared to their expression (a, b).
Figure 13:
Figure 13:. Metagenes and metagene expression levels for BMMC data.
UMAP plots showing metagene expression levels (top) and gene loading values (bottom) for Factor 7. The top panel shows that cells in Cluster 4 show high values of Factor 7, indicating that Factor 7 specifically defines Cluster 4. In gene loading plots, gene names are sorted in decreasing order of magnitude of their factor loading contribution and correspond to colored points in scatterplots. Plots are organized to show the metagene specific to snATAC-seq (left), the shared metagene common to all datasets (middle) and the metagene specific to scRNA-seq profiles (right).
Figure 14:
Figure 14:. Genes showing expression and accessibility differences.
a,b, UMAP representation of expression for CCR6 (a) and NCF1 (b). c,d, UMAP representation of chromatin accessibility of CCR6 (c) and NCF1 (d), which both show distinct chromatin accessibility distributions compared to their expression.
Figure 15:
Figure 15:
UCSC genome browser view showing the correlations between three candidate chromatin accessible regions and target gene S100A9. The locations of three peaks are shown as short black strips within the row “Regions”, and the correlations are illustrated by dotted arcs. H3K27 acetylation and DNasel hypersensitivity across ENCODE cell lines are also shown at the bottom.
Figure 16:
Figure 16:. Expression and correlated accessibility for S100A9 and nearby intergenic peak.
a, UMAP representation of imputed chromatin accessibility of gene S100A9. b, UMAP representation of chromatin accessibility for peak chr1:153358896–153359396.
Figure 17:
Figure 17:. Runtime and peak memory usage for joint factorization of scRNA-seq datasets using LIGER.
We benchmarked LIGER runtime and memory usage using frontal and posterior cortex datasets from the mouse brain.The number of cells range from 10,000 to 100,000. A total of 1109 highly variable genes were selected. The parameters k and λ were specified as 40 and 5. a,b, The runtime (a) and peak memory usage (b) were recorded for the optimizeALS function. KL_div, median KL divergence.

References

    1. Saunders A et al. Molecular Diversity and Specializations among the Cells of the Adult Mouse Brain. Cell 174, 1015–1030.e16 (2018). - PMC - PubMed
    1. Welch JD et al. Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity. Cell 177, 1873–1887.e17 (2019). - PMC - PubMed
    1. Yang Z & Michailidis G A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data. Bioinformatics 32, 1–8 (2016). - PMC - PubMed
    1. Zhang AW et al. Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling. Nat. Methods 16, 1007–1015 (2019). - PMC - PubMed
    1. Cao J et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science 361, 1380–1385 (2018). - PMC - PubMed

Related links

Key references using this protocol
    1. Welch JD et al. Cell 177, 1873–1887.e17 (2019): 10.1016/j.cell.2019.05.006 - DOI - PMC - PubMed
    1. Yao Z et al. An integrated transcriptomic and epigenomic atlas of mouse primary motor cortex cell types. bioRxiv (2020): 10.1101/2020.02.29.970558 - DOI
Key data used in this protocol
    1. Saunders A et al. Cell 174, 1015–1030.e16 (2018): 10.1016/j.cell.2018.07.028 - DOI - PMC - PubMed
    1. Kang H et al. Nat. Biotechnol 36, 89–94 (2018): 10.1038/nbt.4042 - DOI - PMC - PubMed
    1. Granja J et al. Nat. Biotechnol 37, 1458–1465 (2019): 10.1038/s41587-019-0332-7 - DOI - PMC - PubMed

Publication types