Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jun 19;19(1):78.
doi: 10.1186/s13059-018-1449-6.

dropEst: pipeline for accurate estimation of molecular counts in droplet-based single-cell RNA-seq experiments

Affiliations

dropEst: pipeline for accurate estimation of molecular counts in droplet-based single-cell RNA-seq experiments

Viktor Petukhov et al. Genome Biol. .

Abstract

Recent single-cell RNA-seq protocols based on droplet microfluidics use massively multiplexed barcoding to enable simultaneous measurements of transcriptomes for thousands of individual cells. The increasing complexity of such data creates challenges for subsequent computational processing and troubleshooting of these experiments, with few software options currently available. Here, we describe a flexible pipeline for processing droplet-based transcriptome data that implements barcode corrections, classification of cell quality, and diagnostic information about the droplet libraries. We introduce advanced methods for correcting composition bias and sequencing errors affecting cellular and molecular barcodes to provide more accurate estimates of molecular counts in individual cells.

PubMed Disclaimer

Conflict of interest statement

Ethics approval

The animal work conducted for this study was approved by the Institutional Animal Care and Use Committee (IACUC) of Massachusetts General Hospital.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Skewed distribution of UMIs leads to increased number of UMI collisions. a Distribution of UMI occurrence frequencies across all genes is shown for mouse embryonic stem (ES) cells (dataset 1). The top-right inset shows position-specific nucleotide frequencies of the outlier UMIs (highlighted by gray shading on the main plot). Significant skewness of the UMI distribution decreases the effective pool of UMIs. b Proportions of different nucleotides in the UMI sequences are shown as a function of the overall UMI frequencies (x-axis orders UMIs so that most frequently occurring UMI sequences have low rank) for the mouse ES cells (dataset 1). с Estimated number of UMI collisions as a function of the true gene expression level (x-axis) is shown for different UMI lengths (simulated by trimming 10-nucleotide UMIs; see text). The estimates based on the uniform and empirical UMI distributions are shown. The 10x Chromium human post-transplant BMMC dataset (dataset 7) was used. For short UMIs, the number of collisions observed at highly expressed genes can be comparable to the true number of molecules. Longer UMIs decrease the number of collisions
Fig. 2
Fig. 2
Comparison of UMI collision and sequencing error correction methods. Comparison of UMI collision adjustment and UMI correction algorithms is shown using the 10x post-transplant BMMC dataset (dataset 7). a The scatter plot shows percentage error (y-axis) in estimation of the molecular counts for different genes using computationally trimmed UMIs (down to 6–9-nucleotide lengths, as designated by color) from their original 10-nucleotide length, as a function of the full-length UMI estimate (x-axis; see “Methods”). The line shows spline-smoothed dependency with the 95% confidence band. Points show median y value for a given x. The errors result from two opposing trends, with UMI sequencing errors inflating the resulting count estimates, and UMI collisions deflating the estimates. Shortened UMIs result in a larger number of collisions. b The effect of different UMI collision corrections is shown on the 6-nucleotide trimmed UMIs. c Comparison of different UMI sequence error correction methods is shown for the 8-nucleotide trimmed UMIs. UMI collisions were corrected using an empirical approach in all cases except for “no correction”. d We estimated theoretical distribution of edit distances (x-axis) between two randomly sampled UMIs. The theoretical probability of observing a given edit distance is shown as a number above each edit distance group. The histograms show relative absolute difference between this theoretical distribution and observed distributions after the different UMI correction algorithms. For each method and edit distance, the y-axis shows the absolute difference between the observed and theoretical distribution, expressed as a fraction of the theoretical probability of observing that edit distance. e Dependency of the magnitude of UMI correction (y-axis) on the expression magnitude without correction (x-axis) is shown. Each point represents a single gene within a cell, pulled across all cells. Genes with expression magnitude < 10 were omitted
Fig. 3
Fig. 3
Correcting for cellular barcode errors. a The number of molecules mapping to human and mouse genomes in a human–mouse Drop-seq dataset (dataset 12) is shown for each cell (points) on a log scale. The plot shows annotations of high-confidence cells for each organism, doublets, and background barcodes. b The number of equidistant adjacent CBs of larger size (i.e., number of molecules) is shown for each of the observed CBs in the mouse embryonic stem cell dataset (dataset 1). The main plot shows adjacent CBs selected from an a priori known set of valid CB sequences. The inset shows counts of adjacent CBs selected from all CB sequences observed in the dataset. c To illustrate the effect of CB corrections, the plot shows the increase in number of molecules per CB (x-axis) following a CB merge correction procedure, relative to the original size. The 10× 8k PBMC (dataset 13), Drop-seq human–mouse mixture (dataset 12), and inDrop BMC (dataset 11) datasets are shown
Fig. 4
Fig. 4
Selection of the optimal size threshold for the 10x BMMC dataset. These plots show comparison of dropEst and 10× Cell Ranger strategies for initial selection of number of real cells in the 10x BMMC dataset (dataset 8). a The distribution of molecular mass across CBs of different sizes. The y-axis shows the number of UMIs per cell multiplied by the number of cells with a similar number of UMIs. The cells are ranked by their size (number of UMIs), with the largest cells positioned near 0 (see “Methods”). Such “molecular mass” plots can be used to estimate the number of real cells in a dataset. Here, the peak centered around x = 1200 represents real cells. The vertical dashed lines show size-based thresholds, as determined by Cell Ranger (red) and dropEst (green). dropEst threshold admits 1105 additional cells. b The heatmap shows gene expression profiles of cluster-specific genes for the cells that were admitted by both 10x and dropEst thresholds. Expression levels of different genes (columns) are shown by color. Cells (rows) are grouped by cluster (see cluster bar on the right), and then ordered descending by number of molecules (the depth bar on the right). Genes (rows) were clustered using hierarchical clustering. See “Methods” for details. c Similar to b, the heatmap shows expression of the same genes in the set of an additional 1105 cells admitted by the dropEst threshold procedure. The additional cells show expression patterns consistent with their assigned clusters. d t-SNE visualization of the 10x BMMC dataset. All cells which pass both Cell Ranger and dropEst thresholds are shown as circles. Cells which were admitted only with the dropEst threshold are shown as triangles
Fig. 5
Fig. 5
Filtration of low-quality cells for the 10× 8k PBMC dataset. This figure shows the result of the KDE-based algorithm for the filtration of low-quality cells on the 10× 8k PBMC dataset (dataset 13). a t-SNE visualization of the cell subpopulations; only cells which either passed the size threshold or have a quality score > 0.9 are shown. Cells passing the dropEst size threshold and having a quality score ≥ 0.1 are shown with circles. A few cells falling below the size threshold but with a high (> 0.9) quality score are shown with triangles. Cells passing the size threshold but with a low (< 0.1) quality score are considered as filtered and are shown with black crosses. Most filtered cells originated form three distinct clusters, marked by a high fraction of intergenic or mitochondrial reads and a low number of reads per UMI (see labels). bd Distributions of distinguishing characteristics (x-axes) are compared between clusters of low quality cells and the real cell population. Here, we consider a cell to be real if it passes the size threshold and has a quality score > 0.9
Fig. 6
Fig. 6
Filtration of low-quality cells for the inDrop mouse BMC dataset. This figure shows the result of the KDE-based algorithm for filtration of low-quality cells in the inDrop mouse BMC dataset (dataset 11). a, b Similar to Fig. 4b, c, the heatmap shows expression of cluster-specific genes in cells with high quality scores (> 0.9) that were identified above the size-based threshold (a), and “rescued” below the size-based threshold (b). c t-SNE visualization of the dataset, similar to Fig. 5a

Similar articles

Cited by

References

    1. Klein AM, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–1201. doi: 10.1016/j.cell.2015.04.044. - DOI - PMC - PubMed
    1. Macosko EZ, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–1214. doi: 10.1016/j.cell.2015.05.002. - DOI - PMC - PubMed
    1. Fu GK, et al. Counting individual DNA molecules by the stochastic attachment of diverse labels. Proc Natl Acad Sci U S A. 2011;108(22):9026–9031. doi: 10.1073/pnas.1017621108. - DOI - PMC - PubMed
    1. Islam S, et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat Methods. 2014;11(2):163–166. doi: 10.1038/nmeth.2772. - DOI - PubMed
    1. Bose S, et al. Scalable microfluidics for single-cell RNA printing and sequencing. Genome Biol. 2015;16:120. doi: 10.1186/s13059-015-0684-3. - DOI - PMC - PubMed

Publication types