Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Mar;12(1):609-632.
doi: 10.1214/17-AOAS1110. Epub 2018 Mar 9.

A UNIFIED STATISTICAL FRAMEWORK FOR SINGLE CELL AND BULK RNA SEQUENCING DATA

Affiliations

A UNIFIED STATISTICAL FRAMEWORK FOR SINGLE CELL AND BULK RNA SEQUENCING DATA

Lingxue Zhu et al. Ann Appl Stat. 2018 Mar.

Abstract

Recent advances in technology have enabled the measurement of RNA levels for individual cells. Compared to traditional tissue-level bulk RNA-seq data, single cell sequencing yields valuable insights about gene expression profiles for different cell types, which is potentially critical for understanding many complex human diseases. However, developing quantitative tools for such data remains challenging because of high levels of technical noise, especially the "dropout" events. A "dropout" happens when the RNA for a gene fails to be amplified prior to sequencing, producing a "false" zero in the observed data. In this paper, we propose a Unified RNA-Sequencing Model (URSM) for both single cell and bulk RNA-seq data, formulated as a hierarchical model. URSM borrows the strength from both data sources and carefully models the dropouts in single cell data, leading to a more accurate estimation of cell type specific gene expression profile. In addition, URSM naturally provides inference on the dropout entries in single cell data that need to be imputed for downstream analyses, as well as the mixing proportions of different cell types in bulk samples. We adopt an empirical Bayes' approach, where parameters are estimated using the EM algorithm and approximate inference is obtained by Gibbs sampling. Simulation results illustrate that URSM outperforms existing approaches both in correcting for dropouts in single cell data, as well as in deconvolving bulk samples. We also demonstrate an application to gene expression data on fetal brains, where our model successfully imputes the dropout genes and reveals cell type specific expression patterns.

Keywords: EM algorithm; Gibbs sampling; Single cell RNA sequencing; empirical Bayes; hierarchical model.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Plate model of URSM, with both single cell data (on the left) and bulk samples (on the right). The two greyed nodes X and Y represent observed gene expression levels. Node S is a binary variable representing dropout status in single cells, and node W represents the mixing proportions in bulk samples. The node π representing observation probability is double-circled because it is deterministic, and all model parameters are shown without circles, including the profile matrix A that links the two data sources.
Fig. 2
Fig. 2
(a) Simulated logistic dropout probability curves for 100 single cells, as defined in equation (4.1). (b)–(d) True profile matrix A versus the estimated Â, plotted in the log scale, using (i) the naive sample mean estimation [equation (4.2)], (ii) a submodel using only single cell data and (iii) URSM with both single cell and bulk data. The L1 loss Σi, kik – Aik| is reported on the top.
Fig. 3
Fig. 3
The L1 loss of recovering (a) the profile matrix, Σi, kik – Aik|, and (b) mixing proportions, Σk, jkj – Wkj |. We evaluate DSA and ssNMF when the marker genes are extracted from single cell data using different thresholds of p-values, as well as under the oracle condition where the true marker genes are given. We evaluate Cibersort on estimating W when the input signature matrix is based on the estimated  from URSM. We report its performance when the entire  is used (“Cibersort all”), as well as when only the estimated marker genes are used (“Cibersort”). The performance of URSM is plotted with a square in both panels, which does not depend on thresholding p-values.
Fig. 4
Fig. 4
(a) ROC curves of identifying dropout entries in single cell data. (b) True profile matrix A versus the sample average of a rank-3 NMF approximation, plotted in the log scale. The L1 loss Σi, kik – Aik| is reported on the top.
Fig. 5
Fig. 5
The average per cell type L1 loss of recovering the profile matrix A and the mixing proportions W in 10 repetitions, with the standard deviations shown by the error bars, when (a) Ksc, Kbk ∈ {3, 4, 5} with N = 200 genes; (b) N = {200, 500, 1000} with Ksc = Kbk = 3. Each figure shows the performance of (i) URSM, (ii) DSA and ssNMF with 5 true marker genes and 3 imperfect marker genes per cell type as input and (iii) DSA and ssNMF under the oracle scenario where 5 true marker genes per cell type are provided. We also report the performance of Cibersort for estimating W using the estimated Âunif from URSM as the input signature matrix.
Fig. 6
Fig. 6
(a) Single cell gene expressions (log2(FPKM + 1)) after removing 7 ambiguously labeled cells. Rows are 213 cells and columns are 273 genes. (b) PCA applied on the original single cell data with 220 labeled cells using 273 PC genes, where the Monocle algorithm is applied to construct pseudo developmental times. 7 cells are identified to be ambiguously labeled and are removed from our analyses (marked as triangles). (c) Entries in cleaned single cell data that are inferred to be dropout and imputed (marked in blue) versus the entries that are inferred to be structural zeros (marked in white) in cleaned single cell data. The entries with positive expression levels have no need for posterior inference, and are marked in grey. (d) After imputing dropout genes, PCA is conducted on the 213 cells using 273 PC genes, and the three different types of cells are more clearly separated.
Fig. 7
Fig. 7
Deconvolution of bulk samples into three cell types, using (a) URSM, (b) Cibersort, (c) Digital Sorting Algorithm (DSA) and (d) semi-supervised Nonnegative Matrix Factorization (ssNMF).

References

    1. Abbas AR, Wolslegel K, Seshasayee D, Modrusan Z, Clark HF. Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. PLoS ONE. 2009;4:e6098. - PMC - PubMed
    1. Blei DM, Kucukelbir A, McAuliffe JD. Variational inference: A review for statisticians. J Amer Statist Assoc. 2017;112:859–877.
    1. Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. J Mach Learn Res. 2003;3:993–1022.
    1. Brennecke P, Anders S, Kim JK, Kołodziejczyk AA, Zhang X, Proserpio V, Baying B, Benes V, Teichmann SA, Marioni JC, et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat Methods. 2013;10:1093–1095. - PubMed
    1. Camp JG, Badsha F, Florio M, Kanton S, Gerber T, Wilsch-Bräuninger M, Lewitus E, Sykes A, Hevers W, Lancaster M, et al. Human cerebral organoids recapitulate gene expression programs of fetal neocortex development. Proc Natl Acad Sci USA. 2015;112:15672–15677. - PMC - PubMed