dropEst: pipeline for accurate estimation of molecular counts in droplet-based single-cell RNA-seq experiments

doi:10.1186/s13059-018-1449-6

. 2018 Jun 19;19(1):78.

doi: 10.1186/s13059-018-1449-6.

dropEst: pipeline for accurate estimation of molecular counts in droplet-based single-cell RNA-seq experiments

Viktor Petukhov^{1

2}, Jimin Guo², Ninib Baryawno^{3

4

5}, Nicolas Severe^{3

4

5}, David T Scadden^{3

4

5}, Maria G Samsonova¹, Peter V Kharchenko^{6

7}

Affiliations

¹ Department of Applied Mathematics, Peter the Great St. Petersburg Polytechnic University, St. Petersburg, Russia.
² Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
³ Center for Regenerative Medicine, Massachusetts General Hospital, Boston, MA, USA.
⁴ Harvard Stem Cell Institute, Cambridge, MA, USA.
⁵ Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA.
⁶ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. peter.kharchenko@post.harvard.edu.
⁷ Harvard Stem Cell Institute, Cambridge, MA, USA. peter.kharchenko@post.harvard.edu.

PMID: 29921301
PMCID: PMC6010209
DOI: 10.1186/s13059-018-1449-6

dropEst: pipeline for accurate estimation of molecular counts in droplet-based single-cell RNA-seq experiments

Viktor Petukhov et al. Genome Biol. 2018.

. 2018 Jun 19;19(1):78.

doi: 10.1186/s13059-018-1449-6.

Authors

Viktor Petukhov^{1

2}, Jimin Guo², Ninib Baryawno^{3

4

5}, Nicolas Severe^{3

4

5}, David T Scadden^{3

4

5}, Maria G Samsonova¹, Peter V Kharchenko^{6

7}

Affiliations

¹ Department of Applied Mathematics, Peter the Great St. Petersburg Polytechnic University, St. Petersburg, Russia.
² Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
³ Center for Regenerative Medicine, Massachusetts General Hospital, Boston, MA, USA.
⁴ Harvard Stem Cell Institute, Cambridge, MA, USA.
⁵ Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA.
⁶ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. peter.kharchenko@post.harvard.edu.
⁷ Harvard Stem Cell Institute, Cambridge, MA, USA. peter.kharchenko@post.harvard.edu.

PMID: 29921301
PMCID: PMC6010209
DOI: 10.1186/s13059-018-1449-6

Abstract

Recent single-cell RNA-seq protocols based on droplet microfluidics use massively multiplexed barcoding to enable simultaneous measurements of transcriptomes for thousands of individual cells. The increasing complexity of such data creates challenges for subsequent computational processing and troubleshooting of these experiments, with few software options currently available. Here, we describe a flexible pipeline for processing droplet-based transcriptome data that implements barcode corrections, classification of cell quality, and diagnostic information about the droplet libraries. We introduce advanced methods for correcting composition bias and sequencing errors affecting cellular and molecular barcodes to provide more accurate estimates of molecular counts in individual cells.

PubMed Disclaimer

Conflict of interest statement

Ethics approval

The animal work conducted for this study was approved by the Institutional Animal Care and Use Committee (IACUC) of Massachusetts General Hospital.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
Skewed distribution of UMIs leads to increased number of UMI collisions. a Distribution of UMI occurrence frequencies across all genes is shown for mouse embryonic stem (ES) cells (dataset 1). The *top-right inset* shows position-specific nucleotide frequencies of the outlier UMIs (highlighted by *gray shading* on the main plot). Significant skewness of the UMI distribution decreases the effective pool of UMIs. b Proportions of different nucleotides in the UMI sequences are shown as a function of the overall UMI frequencies (x-axis orders UMIs so that most frequently occurring UMI sequences have low rank) for the mouse ES cells (dataset 1). с Estimated number of UMI collisions as a function of the true gene expression level (x-axis) is shown for different UMI lengths (simulated by trimming 10-nucleotide UMIs; see text). The estimates based on the uniform and empirical UMI distributions are shown. The 10x Chromium human post-transplant BMMC dataset (dataset 7) was used. For short UMIs, the number of collisions observed at highly expressed genes can be comparable to the true number of molecules. Longer UMIs decrease the number of collisions

**Fig. 2**
Comparison of UMI collision and sequencing error correction methods. Comparison of UMI collision adjustment and UMI correction algorithms is shown using the 10x post-transplant BMMC dataset (dataset 7). a The scatter plot shows percentage error (y-axis) in estimation of the molecular counts for different genes using computationally trimmed UMIs (down to 6–9-nucleotide lengths, as designated by color) from their original 10-nucleotide length, as a function of the full-length UMI estimate (x-axis; see “Methods”). The *line* shows spline-smoothed dependency with the 95% confidence band. *Points* show median y value for a given x. The errors result from two opposing trends, with UMI sequencing errors inflating the resulting count estimates, and UMI collisions deflating the estimates. Shortened UMIs result in a larger number of collisions. b The effect of different UMI collision corrections is shown on the 6-nucleotide trimmed UMIs. c Comparison of different UMI sequence error correction methods is shown for the 8-nucleotide trimmed UMIs. UMI collisions were corrected using an *empirical* approach in all cases except for “no correction”. d We estimated theoretical distribution of edit distances (x-axis) between two randomly sampled UMIs. The theoretical probability of observing a given edit distance is shown as a number above each edit distance group. The histograms show relative absolute difference between this theoretical distribution and observed distributions after the different UMI correction algorithms. For each method and edit distance, the y-axis shows the absolute difference between the observed and theoretical distribution, expressed as a fraction of the theoretical probability of observing that edit distance. e Dependency of the magnitude of UMI correction (y-axis) on the expression magnitude without correction (x-axis) is shown. Each point represents a single gene within a cell, pulled across all cells. Genes with expression magnitude < 10 were omitted

**Fig. 3**
Correcting for cellular barcode errors. a The number of molecules mapping to human and mouse genomes in a human–mouse Drop-seq dataset (dataset 12) is shown for each cell (*points*) on a log scale. The plot shows annotations of high-confidence cells for each organism, doublets, and background barcodes. b The number of equidistant adjacent CBs of larger size (i.e., number of molecules) is shown for each of the observed CBs in the mouse embryonic stem cell dataset (dataset 1). The main plot shows adjacent CBs selected from an a priori known set of valid CB sequences. The *inset* shows counts of adjacent CBs selected from all CB sequences observed in the dataset. c To illustrate the effect of CB corrections, the plot shows the increase in number of molecules per CB (x-axis) following a CB merge correction procedure, relative to the original size. The 10× 8k PBMC (dataset 13), Drop-seq human–mouse mixture (dataset 12), and inDrop BMC (dataset 11) datasets are shown

**Fig. 4**
Selection of the optimal size threshold for the 10x BMMC dataset. These plots show comparison of dropEst and 10× Cell Ranger strategies for initial selection of number of real cells in the 10x BMMC dataset (dataset 8). a The distribution of molecular mass across CBs of different sizes. The y-axis shows the number of UMIs per cell multiplied by the number of cells with a similar number of UMIs. The cells are ranked by their size (number of UMIs), with the largest cells positioned near 0 (see “Methods”). Such “molecular mass” plots can be used to estimate the number of real cells in a dataset. Here, the peak centered around x = 1200 represents real cells. The *vertical dashed lines* show size-based thresholds, as determined by Cell Ranger (*red*) and dropEst (*green*). dropEst threshold admits 1105 additional cells. b The heatmap shows gene expression profiles of cluster-specific genes for the cells that were admitted by both 10x and dropEst thresholds. Expression levels of different genes (columns) are shown by color. Cells (rows) are grouped by cluster (see *cluster bar* on the *right*), and then ordered descending by number of molecules (the *depth bar* on the *right*). Genes (rows) were clustered using hierarchical clustering. See “Methods” for details. c Similar to b, the heatmap shows expression of the same genes in the set of an additional 1105 cells admitted by the dropEst threshold procedure. The additional cells show expression patterns consistent with their assigned clusters. d t-SNE visualization of the 10x BMMC dataset. All cells which pass both Cell Ranger and dropEst thresholds are shown as *circles*. Cells which were admitted only with the dropEst threshold are shown as *triangles*

**Fig. 5**
Filtration of low-quality cells for the 10× 8k PBMC dataset. This figure shows the result of the KDE-based algorithm for the filtration of low-quality cells on the 10× 8k PBMC dataset (dataset 13). a t-SNE visualization of the cell subpopulations; only cells which either passed the size threshold or have a quality score > 0.9 are shown. Cells passing the dropEst size threshold and having a quality score ≥ 0.1 are shown with *circles*. A few cells falling below the size threshold but with a high (> 0.9) quality score are shown with *triangles*. Cells passing the size threshold but with a low (< 0.1) quality score are considered as filtered and are shown with *black crosses*. Most filtered cells originated form three distinct clusters, marked by a high fraction of intergenic or mitochondrial reads and a low number of reads per UMI (see labels). b–d Distributions of distinguishing characteristics (x-axes) are compared between clusters of low quality cells and the real cell population. Here, we consider a cell to be real if it passes the size threshold and has a quality score > 0.9

**Fig. 6**
Filtration of low-quality cells for the inDrop mouse BMC dataset. This figure shows the result of the KDE-based algorithm for filtration of low-quality cells in the inDrop mouse BMC dataset (dataset 11). a, b Similar to Fig. 4b, c, the heatmap shows expression of cluster-specific genes in cells with high quality scores (> 0.9) that were identified above the size-based threshold (a), and “rescued” below the size-based threshold (b). c t-SNE visualization of the dataset, similar to Fig. 5a

See this image and copyright information in PMC

Cited by

Distinct evolutionary paths in chronic lymphocytic leukemia during resistance to the graft-versus-leukemia effect.
Bachireddy P, Ennis C, Nguyen VN, Gohil SH, Clement K, Shukla SA, Forman J, Barkas N, Freeman S, Bavli N, Elagina L, Leshchiner I, Mohammad AW, Mathewson ND, Keskin DB, Rassenti LZ, Kipps TJ, Brown JR, Getz G, Ho VT, Gnirke A, Neuberg D, Soiffer RJ, Ritz J, Alyea EP, Kharchenko PV, Wu CJ. Bachireddy P, et al. Sci Transl Med. 2020 Sep 16;12(561):eabb7661. doi: 10.1126/scitranslmed.abb7661. Sci Transl Med. 2020. PMID: 32938797 Free PMC article.
Harnessing Single-Cell RNA Sequencing to Identify Dendritic Cell Types, Characterize Their Biological States, and Infer Their Activation Trajectory.
Cheema AS, Duan K, Dalod M, Vu Manh TP. Cheema AS, et al. Methods Mol Biol. 2023;2618:319-373. doi: 10.1007/978-1-0716-2938-3_22. Methods Mol Biol. 2023. PMID: 36905526
Challenges in unsupervised clustering of single-cell RNA-seq data.
Kiselev VY, Andrews TS, Hemberg M. Kiselev VY, et al. Nat Rev Genet. 2019 May;20(5):273-282. doi: 10.1038/s41576-018-0088-9. Nat Rev Genet. 2019. PMID: 30617341 Review.
Accurate estimation of molecular counts from amplicon sequence data with unique molecular identifiers.
Peng X, Dorman KS. Peng X, et al. Bioinformatics. 2023 Jan 1;39(1):btad002. doi: 10.1093/bioinformatics/btad002. Bioinformatics. 2023. PMID: 36610988 Free PMC article.
Characterization of hormone-producing cell types in the teleost pituitary gland using single-cell RNA-seq.
Siddique K, Ager-Wick E, Fontaine R, Weltzien FA, Henkel CV. Siddique K, et al. Sci Data. 2021 Oct 28;8(1):279. doi: 10.1038/s41597-021-01058-8. Sci Data. 2021. PMID: 34711832 Free PMC article.

See all "Cited by" articles

References

1. Klein AM, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–1201. doi: 10.1016/j.cell.2015.04.044. - DOI - PMC - PubMed
1. Macosko EZ, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–1214. doi: 10.1016/j.cell.2015.05.002. - DOI - PMC - PubMed
1. Fu GK, et al. Counting individual DNA molecules by the stochastic attachment of diverse labels. Proc Natl Acad Sci U S A. 2011;108(22):9026–9031. doi: 10.1073/pnas.1017621108. - DOI - PMC - PubMed
1. Islam S, et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat Methods. 2014;11(2):163–166. doi: 10.1038/nmeth.2772. - DOI - PubMed
1. Bose S, et al. Scalable microfluidics for single-cell RNA printing and sequencing. Genome Biol. 2015;16:120. doi: 10.1186/s13059-015-0684-3. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

R01HL131768/HL/NHLBI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- Mouse Genome Informatics (MGI)
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

[1] Klein AM, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–1201. doi: 10.1016/j.cell.2015.04.044. - DOI - PMC - PubMed

[2] Klein AM, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–1201. doi: 10.1016/j.cell.2015.04.044. - DOI - PMC - PubMed

[3] Macosko EZ, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–1214. doi: 10.1016/j.cell.2015.05.002. - DOI - PMC - PubMed

[4] Macosko EZ, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–1214. doi: 10.1016/j.cell.2015.05.002. - DOI - PMC - PubMed

[5] Fu GK, et al. Counting individual DNA molecules by the stochastic attachment of diverse labels. Proc Natl Acad Sci U S A. 2011;108(22):9026–9031. doi: 10.1073/pnas.1017621108. - DOI - PMC - PubMed

[6] Fu GK, et al. Counting individual DNA molecules by the stochastic attachment of diverse labels. Proc Natl Acad Sci U S A. 2011;108(22):9026–9031. doi: 10.1073/pnas.1017621108. - DOI - PMC - PubMed

[7] Islam S, et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat Methods. 2014;11(2):163–166. doi: 10.1038/nmeth.2772. - DOI - PubMed

[8] Islam S, et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat Methods. 2014;11(2):163–166. doi: 10.1038/nmeth.2772. - DOI - PubMed

[9] Bose S, et al. Scalable microfluidics for single-cell RNA printing and sequencing. Genome Biol. 2015;16:120. doi: 10.1186/s13059-015-0684-3. - DOI - PMC - PubMed

[10] Bose S, et al. Scalable microfluidics for single-cell RNA printing and sequencing. Genome Biol. 2015;16:120. doi: 10.1186/s13059-015-0684-3. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed