Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Oct;16(10):1007-1015.
doi: 10.1038/s41592-019-0529-1. Epub 2019 Sep 9.

Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling

Affiliations

Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling

Allen W Zhang et al. Nat Methods. 2019 Oct.

Abstract

Single-cell RNA sequencing has enabled the decomposition of complex tissues into functionally distinct cell types. Often, investigators wish to assign cells to cell types through unsupervised clustering followed by manual annotation or via 'mapping' to existing data. However, manual interpretation scales poorly to large datasets, mapping approaches require purified or pre-annotated data and both are prone to batch effects. To overcome these issues, we present CellAssign, a probabilistic model that leverages prior knowledge of cell-type marker genes to annotate single-cell RNA sequencing data into predefined or de novo cell types. CellAssign automates the process of assigning cells in a highly scalable manner across large datasets while controlling for batch and sample effects. We demonstrate the advantages of CellAssign through extensive simulations and analysis of tumor microenvironment composition in high-grade serous ovarian cancer and follicular lymphoma.

PubMed Disclaimer

Conflict of interest statement

Competing Interests

S.P.S. and S.A. are founders, shareholders, and consultants of Contextual Genomics Inc.

Figures

Figure 1.
Figure 1.
(a) Overview of CellAssign. CellAssign takes raw count data from a heterogeneous single-cell RNA-seq population, along with a set of known marker genes for various cell types under study. Using CellAssign for inference, each cell is probabilistically assigned to a given cell type without any need for manual annotation or intervention, accounting for any batch or sample-specific effects. (b) An overview of the CellAssign probabilistic graphical model. The random variables and data that form the model, along with the distributional assumptions are shown.
Figure 2.
Figure 2.
Performance of CellAssign on simulated data. (a) Accuracy and cell-level F1 score (Methods) for varying proportions of differentially expressed genes per cell type, with other differential expression parameters set to MAP estimates determined from comparing naïve CD8+ and naïve CD4+ T cells (Methods). CellAssign was provided with a set of marker genes (Methods); all other methods were provided with all genes. *, **, *** denote FDR-adjusted p-vaues (Wilcoxon signed-rank test) for pairwise comparisons between CellAssign and other methods < 0.05,0.01,0.001 respectively. Dotted lines separate marker-based, unsupervised, and supervised methods. (b) Accuracy and cell-level F1 score for CellAssign, SCINA (default parameters) and SCINA (sensitivity cutoff of 0.1) for simulated data from 6 cell types, where zero to 4 cell types were removed from the data (but kept in the marker gene list). (c) Accuracy and cell-level F1 score for CellAssign, SCINA (default sensitivity cutoff) and SCINA (sensitivity cutoff of 0.1) for simulated data from 6 cell types, where zero to 4 cell types were removed from the marker gene list. Marker genes were inferred without knowledge of the removed cell types. (d) Cell type labels for human liver data from [21]. (e) CellAssign MAP assignments for human liver data, where marker genes for only hepatocytes, cholangiocytes, and mature B cells from [21] were specified. (f) CellAssign probabilities for cell line mixture data from [14], where known proportions of 3 lung adenocarcinoma cell lines (H1975, H2228, HCC827) were mixed in 9-cell combinations. 30 bulk RNA-derived marker genes for each cell line were used (Supplementary Notes 2.7). Lower and upper hinges denote the 1st and 3rd quartiles on boxplots, with whiskers extending to the largest value less than 1.5 × the inter-quartile range.
Figure 3.
Figure 3.
CellAssign infers the composition of the HGSC microenvironment. (a) UMAP plot of HGSC single cell expression data, labeled by sample. (b) UMAP plot of HGSC single cell expression data, labeled by maximum probability assignments from CellAssign. (c) Proportions of CellAssign cell types in each sample, with total cell counts indicated. (d) Expression (log normalized counts) of EPCAM (for epithelial cells), CD45 (PTPRC) (for hematopoietic cells), MUM1L1 (for ovary-derived cells), and COL1A1 (for collagenproducing fibroblasts and smooth muscle cells). Expression values were winsorized between 0 and 4. (e) Hallmark pathway enrichment results for left ovary vs. right ovary epithelial cells (Methods). (f) Unsupervised clustering of epithelial cells (Methods). (g) Expression (log normalized counts) of epithelial-mesenchymal transition (EMT) associated markers, N-cadherin (CDH2) and CD90 (THY1) in epithelial cells. (h) Expression (log normalized counts) of select HLA class I genes in epithelial cells.
Figure 4.
Figure 4.
CellAssign infers the composition of the follicular lymphoma microenvironment. (a) Sample collection times for FL1018 (transformed FL) and FL2001 (progressed FL). FL1018 is alive while FL2001 was lost to followup (indicated by the red rectangle). The number of cells collected for each sample is indicated. (b) UMAP plot of follicular lymphoma single cell expression data, labeled by sample. (c) UMAP plot of follicular lymphoma single cell expression data, labeled by maximum probability assignments from CellAssign. (d) Expression (log normalized counts) of select marker genes CD79A (for B cells), CD3D (for T cells), CCL5 (for CD8+ T cells), and ICOS (for T follicular helper cells). Expression values were winsorized between 0 and 3.
Figure 5.
Figure 5.
Temporal changes in nonmalignant cells in the follicular lymphoma microenvironment. (a) Left: UMAP plot of CellAssign-inferred B cells, labeled by sample. Right: UMAP pot of CellAssign-inferred B cells, labeled by putative malignant/nonmalignant status. (b) Expression (log normalized counts) of κ (IGKC) and λ (IGLC2 and IGLC3) light chain constant region genes. Expression values were winsorized between 0 and 6. (c) Scvis plot of follicular lymphoma data and single cell RNA-seq data of lymphocytes from reactive lymph nodes from healthy patients. The follicular lymphoma data was used to train the variational autoencoder and produce the two-dimensional embedding. Indicated cell types are B cell (nonmalignant B cell from FL), B cell (malignant) (malignant B cell from FL), T cell (T cell from FL), RLN (reactive lymph node cell). (d) Relative proportion of B cell subpopulations over time, with total B cell counts indicated. (e) UMAP plots of FL T cells, labeled by sample and CellAssign-inferred celltype. (f) Relative proportion of T cell subpopulations over time, with total T cell counts indicated. (g) Normalized expression of CD8+ T cell activation markers over time. P-values computed with the two-sided Wilcoxon rank-sum test and adjusted with the Benjamini-Hochberg method. n = 95, 96, 90, and 23 single cells identified as CD8+ T cells in FL1018T1, FL1018T2, FL2001T1, and FL2001T2, respectively. Lower and upper hinges denote the 1st and 3rd quartiles on boxplots, with whiskers extending to the largest value less than 1.5 × the inter-quartile range.

References

    1. Consortium TM et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature (2018). - PMC - PubMed
    1. Kiselev VY et al. SC3: consensus clustering of single-cell RNA-seq data. Nature methods 14, 483 (2017). - PMC - PubMed
    1. Butler A, Hoffman P, Smibert P, Papalexi E & Satija R Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology. ISSN: 1087–0156. doi: 10.1038/nbt.4096. https://www.nature.com/articles/nbt.4096 (2018). - DOI - PMC - PubMed
    1. Zurauskiene J & Yau C pcaReduce: hierarchical clustering of single cell transcriptional profiles. BMC bioinformatics 17, 140 (2016). - PMC - PubMed
    1. Levine JH et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015). - PMC - PubMed

Publication types