. 2021 Jan;17(1):e9620.

doi: 10.15252/msb.20209620.

Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models

Chenling Xu¹, Romain Lopez², Edouard Mehlman^{2

3}, Jeffrey Regier⁴, Michael I Jordan^{2

5}, Nir Yosef^{1

2

6

7}

Affiliations

¹ Center for Computational Biology, University of California, Berkeley, CA, USA.
² Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA.
³ Centre de Mathématiques Appliquées École polytechnique, Palaiseau, France.
⁴ Department of Statistics, University of Michigan, Ann Arbor, MI, USA.
⁵ Department of Statistics, University of California, Berkeley, CA, USA.
⁶ Ragon Institute of MGH, MIT and Harvard, Boston, MA, USA.
⁷ Chan-Zuckerberg Biohub Investigator, San Francisco, CA, USA.

PMID: 33491336
PMCID: PMC7829634
DOI: 10.15252/msb.20209620

Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models

Chenling Xu et al. Mol Syst Biol. 2021 Jan.

. 2021 Jan;17(1):e9620.

doi: 10.15252/msb.20209620.

Authors

Chenling Xu¹, Romain Lopez², Edouard Mehlman^{2

3}, Jeffrey Regier⁴, Michael I Jordan^{2

5}, Nir Yosef^{1

2

6

7}

Affiliations

¹ Center for Computational Biology, University of California, Berkeley, CA, USA.
² Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA.
³ Centre de Mathématiques Appliquées École polytechnique, Palaiseau, France.
⁴ Department of Statistics, University of Michigan, Ann Arbor, MI, USA.
⁵ Department of Statistics, University of California, Berkeley, CA, USA.
⁶ Ragon Institute of MGH, MIT and Harvard, Boston, MA, USA.
⁷ Chan-Zuckerberg Biohub Investigator, San Francisco, CA, USA.

PMID: 33491336
PMCID: PMC7829634
DOI: 10.15252/msb.20209620

Abstract

As the number of single-cell transcriptomics datasets grows, the natural next step is to integrate the accumulating data to achieve a common ontology of cell types and states. However, it is not straightforward to compare gene expression levels across datasets and to automatically assign cell type labels in a new dataset based on existing annotations. In this manuscript, we demonstrate that our previously developed method, scVI, provides an effective and fully probabilistic approach for joint representation and analysis of scRNA-seq data, while accounting for uncertainty caused by biological and measurement noise. We also introduce single-cell ANnotation using Variational Inference (scANVI), a semi-supervised variant of scVI designed to leverage existing cell state annotations. We demonstrate that scVI and scANVI compare favorably to state-of-the-art methods for data integration and cell state annotation in terms of accuracy, scalability, and adaptability to challenging settings. In contrast to existing methods, scVI and scANVI integrate multiple datasets with a single generative model that can be directly used for downstream tasks, such as differential expression. Both methods are easily accessible through scvi-tools.

Keywords: annotation; differential expression; harmonization; scRNA-seq; variational inference.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Figure 1. Harmonization and annotation of scRNA‐seq datasets with generative models**
Functional overview of the methods proposed in this manuscript.
Schematic diagram of the variational inference procedure in both of the scVI and scANVI models. We show the order in which random variables in the generative model are sampled and how these variables can be used to derive biological insights.
The graphical models of scVI and scANVI. Vertices with black edges represent variables in both scVI and scANVI, and vertices with red edges are unique to scANVI. Shaded vertices represent observed random variables. Semi‐shaded vertices represent variables that can be either observed or random. Empty vertices represent latent random variables. Edges signify conditional dependency. Rectangles (“plates”) represent independent replication. The complete model specification and definition of internal variables is provided in the Materials and Methods

**Figure 2. Benchmarking of scRNA‐seq harmonization algorithms**
Each row is a different dataset. Each column is a metric.
k‐nearest neighbors purity that ranges from 0 to 1, with higher values meaning better preservation of neighbor structure in the individual datasets after harmonization.
Entropy of batch mixing where higher values means that the cells from different datasets are well mixed.
The trade‐off between the kNN purity and entropy of batch mixing for a fixed K = 150. Methods on the top right corner have better performances.
The trade‐off between entropy of batch mixing and the preservation of biological information using an alternative unsupervised statistic k‐means clustering preservation.

**Figure 3. Harmonizing datasets with different cellular composition**
A–D
The case when no cell type is shared. PBMC‐8K contains all cells other than cell type c ₀ while PBMC‐CITE contains only cell type c ₀. Six experiments were run, each keeping one cell type from the PBMC‐CITE dataset. (A, B) UMAP visualization for the case where c ₀ corresponds to natural killer cells. (C, D) entropy of batch mixing and k‐nearest neighbors purity, aggregating the six experiments (setting c ₀ to a different cell type in each experiment). Data information: Red arrows indicate the desired direction for each performance measure. Low batch entropy is desirable in (C) while high k‐nearest neighbors purity is desirable in (D).
E–I
The case when cell type c ₀ is removed PBMC‐8K but not from PBMC‐CITE. Six experiments were run, each removing one cell type from the PBMC‐CITE dataset. (E, F) UMAP visualization for the case where c ₀ corresponds to CD4⁺ T cells. (G) entropy of batch mixing for the removed cell type. Lower value is more desirable as indicated by the red arrow. (H) entropy of batch mixing for the remaining cell types. Higher value is more desirable as indicated by the red arrow. (I) k‐nearest neighbors purity. Higher value is more desirable as indicated by the red arrow.

**Figure 4. Harmonizing developmental trajectories**
A, B
UMAP visualization of the scVI latent space, with cells colored by the original labels from either the HEMATO‐Paul (A) or HEMATO‐Tusi (B) studies. The cells from the other dataset are colored in gray.
C
Entropy of batch mixing along 20 bins of the HEMATO‐Tusi cells, ordered by the potential of each cell. Potential is a pseudotime measure that describes the differentiation state of a cell using the population balance analysis algorithm (center: common myeloid progenitors; moving left: erythrocyte branch; moving right: granulocyte branch).
D
k‐nearest neighbors purity for scVI, Seurat, and scANVI.
E
Expression of marker genes that help determine the identity of batch‐unique cells.

**Figure 5. Validation of cell type annotations using additional metadata**
A, B
UMAP plot of the scANVI latent space inferred for three harmonized datasets: PBMC‐CITE, PBMC‐sorted, and PBMC‐68K. Cells are colored by the dataset of origin (A) and the PBMC‐sorted labels (B). Cells from the PBMC‐CITE and PBMC‐68K are colored in gray in (B).
C
The consistency of the harmonized PBMC‐CITE mRNA data with the respective protein measurements, evaluated by mean squared error and for different neighborhood size. Lower values indicate higher consistency.
D
UMAP plot of the scANVI latent space, where cells are colored by normalized protein measurement. Only PBMC‐CITE cells are displayed.
E
UMAP plot of the scANVI latent space, with cells from the PBMC‐68k dataset colored according to their original label. For clarity of presentation, only cells originally labeled as dendritic cells or natural killer cells are colored. Evidently, a large number of these cells are mapped to a cluster of T cells (right side of the plot).

**Figure 6. Cell type annotation in a single dataset using “seed” labeling**
A
Discrepancies between marker genes that can be used to confidently label cells and highly variable genes in scRNA‐seq analysis.
B–D
UMAP plot of the scVI latent space. (B) Seed cells are colored by their annotation (using known marker genes). (C) PBMC‐sorted cell type labels from the original study based on marker‐based sorting. (D) The posterior probability of each cell being one of the four T cell subtypes obtained with scANVI.

**Figure 7. Differential Expression on multiple datasets with scVI**
Evaluation of consistency with Spearman rank correlation and Kendall‐Tau is shown for comparisons of multiple pairs of cell types in the simulated data. For each comparison, we subsampled 30 cells from each group, and repeated the subsampling 10 times to evaluate the uncertainty in our result.
Distribution of true log fold change between all pairs of cell types for the simulated data. The pairs of cells are chosen to represent different levels of distance on the tree as in Appendix Fig S18A. The pairs of population from most distant to least distant are “12”, “24”, “23”, “45”.
Evaluation of consistency with the AUROC and Kendal Tau metric is shown for comparisons of CD4 vs CD8 T cells and B cells vs dendritic cells on the PBMC‐8K only (A), the PBMC‐68k only (B) and the merged PBMC‐8K / PBMC‐68K (A + B) for scVI and edgeR. For each comparison, we subsampled 30 cells from each group, and repeated the subsampling 10 times to evaluate the uncertainty in our result.
Mislabeling experiment in differential expression in both the SymSim simulated datasets and in the PBMC8K and PBMC68K dataset. The top row shows differential expression results for the correctly labeled population pair (Population 1 vs. Population 2 in simulated dataset and CD4 T cells vs. CD8 T cells in PBMC dataset. The bottom row shows differential expression results for the mislabelled population pair (Population 2 vs. Population 3 in simulated dataset and dendritic cells vs. B cells in PBMC dataset). For all, x‐axis represents the proportion of flipped labels.
Data information: The boxplots are standard Tukey boxplots where the box is delineated by the first and third quartile and the whisker lines are the first and third quartile plus minus 1.5 times the box height. The dots are outliers that fall above or below the whisker lines. The center band indicates the median.

See this image and copyright information in PMC

References

1. Amodio M, Krishnaswamy S (2018) Magan: Aligning biological manifolds. Proc Int Conf Mach Learn 80: 215–223
1. Amodio M, van Dijk D, Srinivasan K, Chen WS, Mohsen H, Moon KR, Campbell A, Zhao Y, Wang X, Venkataswamy M et al (2019) Exploring single‐cell data with deep multitasking neural networks. Nat Meth : 1–7 - PMC - PubMed
1. Andrews TS, Hemberg M (2019) M3Drop: dropout‐based feature selection for scRNASeq. Bioinformatics 35: 2865–2867. - PMC - PubMed
1. Angerer P, Simon L, Tritschler S, Wolf FA, Fischer D, Theis FJ (2017) Single cells make big data: New challenges and opportunities in transcriptomics. Curr Opin Syst Biol 4: 85–91
1. Baron M, Veres A, Wolock SL, Faust AL, Gaujoux R, Vetere A, Ryu JH, Wagner BK, Shen‐Orr SS, Klein AM et al (2016) A single‐cell transcriptomic map of the human and mouse pancreas reveals inter‐and intra‐cell population structure. Cell Syst 3: 346–360 - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models

Affiliations

Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials