Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jul 27;21(1):183.
doi: 10.1186/s13059-020-02103-2.

Bayesian model selection reveals biological origins of zero inflation in single-cell transcriptomics

Affiliations

Bayesian model selection reveals biological origins of zero inflation in single-cell transcriptomics

Kwangbom Choi et al. Genome Biol. .

Erratum in

Abstract

Background: Single-cell RNA sequencing is a powerful tool for characterizing cellular heterogeneity in gene expression. However, high variability and a large number of zero counts present challenges for analysis and interpretation. There is substantial controversy over the origins and proper treatment of zeros and no consensus on whether zero-inflated count distributions are necessary or even useful. While some studies assume the existence of zero inflation due to technical artifacts and attempt to impute the missing information, other recent studies argue that there is no zero inflation in scRNA-seq data.

Results: We apply a Bayesian model selection approach to unambiguously demonstrate zero inflation in multiple biologically realistic scRNA-seq datasets. We show that the primary causes of zero inflation are not technical but rather biological in nature. We also demonstrate that parameter estimates from the zero-inflated negative binomial distribution are an unreliable indicator of zero inflation.

Conclusions: Despite the existence of zero inflation in scRNA-seq counts, we recommend the generalized linear model with negative binomial count distribution, not zero-inflated, as a suitable reference model for scRNA-seq analysis.

Keywords: Bayesian model selection; Cell heterogeneity; Gene expression stochasticity; Single-cell RNA sequencing; Zero inflation.

PubMed Disclaimer

Conflict of interest statement

None of the authors declares any competing interests.

Figures

Fig. 1
Fig. 1
Factors that determine the number of zeros in scRNA-seq data. a Total UMI counts per cell, which range from 746 to 17,302 with average 3819 UMIs per cell, are plotted against the number of zeros per cell. Color coding indicates the individual cell types as determined by data-driven clustering. The proportion of variance in the number of zeros that is explained by the total UMI count per cell (R2= 0.947) was computed based on fitting a loess regression to the data (blue curve). b The per-gene rates of expression (μg), which range from 0.23 to 97.4 with average 1.51 UMI/10K, are plotted against the number of zeros per gene. Genes that were identified as zero-inflated by scRATE (1 SE) are indicated in dark blue
Fig. 2
Fig. 2
Classification of genes by scRATE using the threshold of 1 SE. a A density histogram shows the model selection for genes by scRATE as a function of percent non-zero cells. The ZI genes are uniformly distributed across the range, including genes with few zero counts. b Density histogram of scRATE classification collapsed to show only the ZI versus NotZI genes across percentages of non-zero cells. c Distribution of percent of cells with non-zero UMI counts for genes according to scRATE classification. d Distribution of average expression levels of genes according to scRATE classification. Also see Additional file 1: Fig. S12 for the results with the other (0, 2, and 3 SE) thresholds
Fig. 3
Fig. 3
Zeros cluster within specific cell types. A bi-clustered heatmap of ZI genes (1 SE) by cell type shows that zeros occur more frequently in specific cell types. The color scale indicates the difference between the cell type-specific proportion of zeros and the mean proportion of zeros across all cells regardless of type. Light shading indicates cell types that have highest frequency of zero UMI counts. Dendrograms are shown in Additional file 1: Fig. S5
Fig. 4
Fig. 4
Effect of accounting for cell type on estimated zero inflation and overdispersion. a The scatterplot shows estimated zero inflation π^0 before and after including cell type in the GLM with ZINB error model. Color coding indicates the ZI classification of genes (1 SE) before and after accounting for cell type. The red point (ZI:ZI) at 0.3 on the diagonal is Xist. The light blue point (NotZI:NotZI) to the right is Ddx3y. b The scatterplot shows the estimated overdispersion r^ before and after including cell type in the GLM with NB error model
Fig. 5
Fig. 5
Estimating zero inflation with a ZINB model. Zero inflation probability π^0 estimated by ZINB on simulated NB data before cell type adjustment (a) and after cell type adjustment (b). Since simulated NB data does not contain zero inflation, it is implicit that the ZINB model should produce estimates of π^0 that are zero or very small. However, we find substantial overestimation of this quantity for many simulated genes. Scatterplots of true versus estimated zero inflation π^0 by ZINB on simulated ZINB data before cell type adjustment (c) and after cell type adjustment (d). Once cell type heterogeneity is regressed out, zero inflation is reduced

References

    1. Islam S, Zeisel A, Joost S, La Manno G, Zajac P, Kasper M, et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat Methods. 2014;11(2):163–6. - PubMed
    1. Cao J, Packer JS, Ramani V, Cusanovich DA, Huynh C, Daza R, et al. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science. 2017;357(6352):661–7. - PMC - PubMed
    1. Rosenberg AB, Roco CM, Muscat RA, Kuchina A, Sample P, Yao Z, et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science. 2018;360(6385):176–82. - PMC - PubMed
    1. Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–14. - PMC - PubMed
    1. Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–201. - PMC - PubMed