Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Apr 27:17:75.
doi: 10.1186/s13059-016-0947-7.

Pooling across cells to normalize single-cell RNA sequencing data with many zero counts

Affiliations

Pooling across cells to normalize single-cell RNA sequencing data with many zero counts

Aaron T L Lun et al. Genome Biol. .

Abstract

Normalization of single-cell RNA sequencing data is necessary to eliminate cell-specific biases prior to downstream analyses. However, this is not straightforward for noisy single-cell data where many counts are zero. We present a novel approach where expression values are summed across pools of cells, and the summed values are used for normalization. Pool-based size factors are then deconvolved to yield cell-based factors. Our deconvolution approach outperforms existing methods for accurate normalization of cell-specific biases in simulated data. Similar behavior is observed in real data, where deconvolution improves the relevance of results of downstream analyses.

Keywords: Differential expression; Normalization; Single-cell RNA-seq.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Performance of existing normalization methods on the simulated data with DE genes and stochastic zeroes. The size factor estimates for all cells are plotted against the true values for a DESeq, b TMM, and c library size normalization. Simulations were performed with no DE (first row), moderate DE (second row), strong DE (third row), and varying magnitudes of DE (fourth row). Axes are shown on a log-scale. For comparison, each set of size factors was scaled such that the grand mean across cells was the same as that for the true values. The red line represents equality between the rescaled estimates and true factors. Cells in the first, second, and third subpopulations are shown in black, blue, and orange, respectively. DE differentially expressed, TMM trimmed mean of M values
Fig. 2
Fig. 2
Illustration of the effect of removing stochastic zeroes (black) from the distribution of ratios across all genes. Distributions are shown for cells with a small and b large θ j. The estimated median ratio (dashed) is increased beyond the true median (full) upon removal of zeroes, which results in overestimation of the size factor for the cell. This effect is more pronounced for cells with small θ j that have greater numbers of zeroes, compared to cells with large θ j where the estimated and true medians are more similar
Fig. 3
Fig. 3
Schematic of the deconvolution method. All cells in the data set are averaged to make a reference pseudo-cell. Expression values for cells in pool A are summed together and normalized against the reference to yield a pool-based size factor θ A. This is equal to the sum of the cell-based factors θ j for cells j=1–4 and can be used to formulate a linear equation. (For simplicity, the t j term is assumed to be unity here.) Repeating this for multiple pools (e.g., pool B) leads to the construction of a linear system that can be solved to estimate θ j for each cell j
Fig. 4
Fig. 4
Size factor estimates from the deconvolution method in the simulation with DE genes and stochastic zeroes. These are shown against the true values for scenarios with a no DE, b moderate DE, c strong DE, and d varying magnitude of DE. Cells in the first, second, and third subpopulations are shown in black, blue, and orange, respectively. Axes are shown on a log-scale, and the red line represents equality with the true factors. DE differentially expressed
Fig. 5
Fig. 5
Comparisons between the estimated size factors. Those from the deconvolution method are compared to those from a DESeq, b TMM, and c library size normalization. This is shown for the brain (top) and inDrop data sets (bottom). Axes are on a log-scale, and the red line represents equality between the two sets of factors. All sets of factors were centered to a median of unity prior to comparison. For the brain data, cells classified by Zeisel et al. as oligodendrocytes or pyramidal CA1 cells are shown here in orange and blue, respectively. TMM trimmed mean of M values

References

    1. Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet. 2015;16(3):133–45. doi: 10.1038/nrg3833. - DOI - PubMed
    1. Islam S, Zeisel A, Joost S, La Manno G, Zajac P, Kasper M, et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat Methods. 2014;11(2):163–6. doi: 10.1038/nmeth.2772. - DOI - PubMed
    1. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):106. doi: 10.1186/gb-2010-11-10-r106. - DOI - PMC - PubMed
    1. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11(3):25. doi: 10.1186/gb-2010-11-3-r25. - DOI - PMC - PubMed
    1. Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–201. doi: 10.1016/j.cell.2015.04.044. - DOI - PMC - PubMed

Publication types