Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 May 26;17(1):112.
doi: 10.1186/s13059-016-0970-8.

Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts

Affiliations

Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts

Vasilis Ntranos et al. Genome Biol. .

Abstract

Current approaches to single-cell transcriptomic analysis are computationally intensive and require assay-specific modeling, which limits their scope and generality. We propose a novel method that compares and clusters cells based on their transcript-compatibility read counts rather than on the transcript or gene quantifications used in standard analysis pipelines. In the reanalysis of two landmark yet disparate single-cell RNA-seq datasets, we show that our method is up to two orders of magnitude faster than previous approaches, provides accurate and in some cases improved results, and is directly applicable to data from a wide variety of assays.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Equivalence class and transcript-compatibility counts. This figure gives an example of how reads are collapsed into equivalence classes. Each read is mapped to one or more transcripts in the reference transcriptome; these are transcripts that the read is compatible with, i.e., the transcripts that the read could possibly have come from. For example, read 1 is compatible with transcripts t1 and t3, read 2 is compatible with transcripts t1 and t2, and so on. An equivalence class is a group of reads that is compatible with the same set of transcripts. For example, reads 4,5,6,7,8 are all compatible with t1, t2, and t3, and they form an equivalence class. Since the reads in an equivalence class are all compatible with the same set of transcripts, we simply represent an equivalence class by that set of transcripts. For example, the equivalence class consisting of reads 4,5,6,7,8 is represented by {t1,t2,t3}. Aggregating the number of reads in each equivalence class yields the corresponding transcript-compatibility counts. Note that in order to estimate the transcript abundances from the transcript-compatibility counts, a read-generation model is needed to resolve the multi-mapped reads
Fig. 2
Fig. 2
Overview of the method. This figure illustrates our transcript-compatibility count (TCC) clustering method in a very simple, yet instructive example and highlights its major differences with respect to the conventional single-cell clustering approach. Here, we consider an scRNA-seq example with K cells (only the reads coming from Cell1 and Cell2 are shown here) and a reference transcriptome consisting of three transcripts, t 1, t 2, and t 3. Conventional approach: Single cells are clustered based on their transcript or gene abundances (here we only focus on transcripts for concreteness). This widely adopted pipeline involves computing a (#transcripts × #cells) expression matrix by first aligning each cell’s reads to the reference. The corresponding alignment information is next to each read, which for the purpose of illustration only contains the mapped positions (the aligned reads of Cell1 are also annotated directly on the transcripts). While reads 1 and 5 are uniquely mapped to transcripts 1 and 3, reads 2, 3, and 4 are mapped to multiple transcripts (multi-mapped reads). The quantification step must therefore take into account a specific read-generating model and handle multi-mapped reads accordingly. Our proposed method: Single cells are clustered based on their transcript-compatibility counts. Our method assigns the reads of each cell to equivalence classes via the process of pseudoalignment and simply counts the number of reads that fall in each class to construct a (# eq.classes × #cells) matrix of transcript-compatibility counts. Then, the method proceeds by directly using the transcript-compatibility counts for downstream processing and single-cell clustering. The underlying idea here is that even though equivalence classes may not have an explicit biological interpretation, their read counts can collectively provide us with a distinct signature of each cell’s gene expression; transcript-compatibility counts can be thought of as feature vectors, and cells can be identified by their differential expression over these features. Compared to the conventional approach, our method does not attempt to resolve multi-mapped reads (no need for an assay-specific read-generating model) and only requires transcript compatibility information for each read (no need for exact read alignment)
Fig. 3
Fig. 3
Runtime comparison of alignment methods. a The time required to process 3005 cells from mouse brain cell dataset [7] in core hours is shown here. The time taken for read alignment with Bowtie and HISAT is much larger than the time taken for kallisto pseudoalignment (which is used by our method to obtain the transcript-compatibility counts). kallisto pseudoalignment and HISAT were run on 32 cores. Bowtie and Word Count were each timed on 1 core with 10 randomly selected cells. The bars shown here are estimates obtained by multiplying these times by 300.5. Because we do not account for the overhead associated with parallelizing, the Bowtie and Word Count estimates are lower bounds on their runtimes in practice. After preprocessing the UMIs, each of the 5,914,602,849 single-end reads in the dataset were less than 50 bp long. b The time required to process 271 cells of the dataset of [12] in core hours is shown here. As before, the time taken for read alignment with HISAT is significantly larger than the time taken for kallisto pseudoalignment. Both methods were run on 32 cores. The dataset has 814,344,693 paired-end reads, and each mate in a pair is 100 bp long
Fig. 4
Fig. 4
Temporal ordering of differentiating primary human myoblasts using transcript-compatibility counts. a A minimum spanning tree (MST) was drawn through the 271 cells using Jensen-Shannon distances computed between 1,101,805-dimensional vectors of TCCs of cells in the dataset. Following the longest path does not show a clear cell differentiation pattern. b Affinity propagation clustering generated seven clusters (after collapsing spurious clusters with less than five cells into their nearest neighboring cluster), and an MST was drawn through the centroids of the clusters. Using the labels from Trapnell et al. [12], the longest path shows a differentiation pattern from proliferating cell (red) to differentiating myoblast (blue). The MST also shows how some proliferating cells alternatively differentiate into interstitial mesenchymal cells (green). c The cells were then clustered into three groups based on their transcript-compatibility counts, and the MST from b was relabeled using these new cell types. d The expression levels of the genes MYOG, CDK1, and PDGFRA were analyzed for the three TCC clusters. MYOG, CDK1, and PDGFRA show greater expression for centroids from clusters with greater proportions of TCC cell types 1, 2, and 3, respectively. For each gene, a histogram over each centroid shows how expression level evolves with the differentiation process. CDK1, MYOG, and PDGFRA being markers for proliferating cells, differentiating myoblasts, and interstitial mesenchymal cells indicates that the clustering and centroid-ordering based on TCC captures intermediate steps of the human myoblast differentiation trajectory
Fig. 5
Fig. 5
Clustering primary human myoblasts based on transcript-compatibility counts. a The transcript-compatibility counts matrix for 271 primary human myoblasts from [12] is visualized using a diffusion map. Three clusters obtained using affinity propagation are shown along with the distribution of these cells across the four cell-collection timepoints (0, 24, 48, and 72 hours). b The diffusion map obtained using transcript compatibility counts is relabeled using the cells reported by [12]. Clusters 1,2,3 generated by the transcript compatibility based method map to proliferating cells, differentiating myoblasts, and interstitial cells, respectively. According to Trapnell et al.’s labels, the transcript compatibility based method seems to have severely misclassified cell T48_CT_G10 (SRR1033183) as a differentiating myoblast. c Comparing the expressions of 12 differentiating genes in T48_CT_G10 with those of the average proliferating cell and the average differentiating myoblast, 8 out of the 12 genes show expressions similar to what one would expect from a differentiating myoblast. MYOG seems to show an FPKM of 14, which while more than the mean expression of proliferating cells (around 0.28) is much less than the mean expression of differentiating myoblasts (around 61.33). We note that this cell has the highest expression of MYOG among all cells labeled by Trapnell et al. as proliferating cell (and the second highest cell has expression around 5.4). However there are 88 differentiating myoblasts with MYOG expression less than 15 FPKM. Hence it is reasonable to think that this MYOG expression is more typical of differentiating myoblasts than proliferating cells. Only genes CDK1 and CCMB2 show expressions close to what one would expect from a proliferating cell. Even though CDK1 is a highly specific marker for proliferating cells, the above gene profile indicates that classifying cell T48_CT_G10 as a differentiating myoblast seems reasonable
Fig. 6
Fig. 6
Clustering mouse brain cells based on transcript-compatibility counts. a The TCC distribution and gene expression matrices for the 3005 mouse brain cells are visualized using t-SNE (based on Jensen-Shannon distances between TCC distributions and gene expression distributions of cells, respectively) and colored with the cell type determined by Zeisel et al. [7]. We note that the transcript-compatibility based t-SNE also visually maintains the cluster structure of the nine major clusters, even though it can be computed two orders of magnitude faster than the gene expression matrix. b Cells from each of two cell types determined by Zeisel et al. were randomly selected, and then the clustering accuracy of multiple methods was tested. The clustering accuracy was measured as the error rate of the clustering. First, we note that the 3’-end bias in this dataset significantly affects the accuracy of kallisto and eXpress that have been chosen here as representative methods for model-based quantification (see Methods). For both methods clustering was performed on gene expression profiles obtained by summing the corresponding transcript abundances. For each point in the eXpress and kallisto curves, we took the minimum of the error rates obtained with bias modeling turned on and off. By avoiding estimation of the read model, transcript-compatibility based methods were indeed more accurate. We see that transcript-compatibility based clustering achieves similar accuracy to the gene-level UMI counting method implemented by the authors for this dataset without explicitly accounting for PCR biases. Refining transcript-compatibility counting to correct for PCR biases (by counting only the distinct UMIs of reads in each equivalence class) leads to a marginal improvement of our method. c Running affinity propagation on the TCC distribution matrix (using negative Jensen-Shannon distance as similarity metric) produced a cluster of 28 cells, 24 of which were labeled Oligo1. Zeisel et al. [7] classified 45 of the 3005 cells as this new class of cells. The bar plot compares the mean expression of selected oligodendrocyte marker genes in the TCC cluster to their mean expression in Zeisel et al.’s Oligo1. As reported in [7], Oligo1 cells are characterized by their distinct expression of genes such as Itpr2, Rnf122, Idh1, and Gpr17. The similarity of the bars seems to suggest that clustering on TCC can capture this fine-grained information. Note that although single-cell clustering was entirely performed based on transcript-compatibility counts, the gene expression data used to evaluate this figure were obtained from Zeisel et al
Fig. 7
Fig. 7
Analysis on transcript-compatibility counts refines the classification of mouse brain cells. a Runs of affinity propagation with different propagation and damping parameters were carried out on the TCC matrix for the 3005 mouse brain cells of [7]. The Oligo3 subclass discovered by Zeisel et al. was consistently split into two subclasses A and B. b Cells in the Oligo3 B class showed greater expression of endothelial/vascular genes and lower expression of myelinating oligodendrocyte genes. The opposite was true for the cells in Oligo3 A. This result may corroborate the potential contamination of oligodendrocytes in the Zeisel et al. dataset that has recently been reported in Fan et al. [37]
Fig. 8
Fig. 8
Coverage requirements for clustering based on transcript-compatibility counts. As an intermediate between raw reads and quantified transcript abundances, transcript-compatibility counts intuitively have more information than the reconstructed transcripts and less noise than the raw reads. a As the read coverage of a cell of the dataset of [7] decreases from approximately 627k mapped reads, different methods have varying robustness to the loss of coverage. Each method was evaluated on its ability to cluster 200 randomly selected neurons mixed with 200 randomly selected non-neurons into the two cell types (the clustering of [7] being considered as the ground truth). Among methods which do not explicitly account for PCR bias, TCC based clustering performed much better than kallisto and eXpress and was quite close in performance to the UMI counting method of [7]. For both kallisto and eXpress, clustering was performed on gene expression profiles obtained by summing the corresponding transcript abundances. For each point in the eXpress and kallisto curves, we took the minimum of the error rates obtained with bias modeling turned on and off. By counting the number of unique UMIs rather than reads (TCC with UMI in the plot), transcript-compatibility based clustering was adapted to account for PCR bias, resulting in similar performance to that of gene-level UMI counting used in [7]. b Even at significantly decreased coverage depths, our method maintains clusters corresponding to the nine major cell types identified by Zeisel et al. The transcript-compatibility distribution matrices at varying coverage depths are visualized using t-SNE. c At various coverage depths, transcript-compatibility counting with UMIs disagrees slightly with the cells the authors labeled as Oligo1 (cyan cell IDs). As the coverage decreases, transcript-compatibility based affinity propagation still identifies a cluster that captures the vast majority of Oligo1 cells in the 3005-cell population

References

    1. Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, Cahill DP, Nahed BV, Curry WT, Martuza RL, Louis DN, Rozenblatt-Rosen O, Suvà ML, Regev A, Bernstein BE. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science. 2014;344(6190):1396–1401. doi: 10.1126/science.1254257. - DOI - PMC - PubMed
    1. Pollen AA, Nowakowski TJ, Chen J, Retallack H, Sandoval-Espinosa C, Nicholas CR, Shuga J, Liu SJ, Oldham MC, Diaz A, Lim DA, Leyrat AA, West JA, Kriegstein AR. Molecular identity of human outer radial glia during cortical development. Cell; 163(1):55–67. doi:10.1016/j.cell.2015.09.004. - DOI - PMC - PubMed
    1. Gaublomme JT, Yosef N, Lee Y, Gertner RS, Yang LV, Wu C, Pandolfi PP, Mak T, Satija R, Shalek AK, Kuchroo VK, Park H, Regev A. Single-cell genomics unveils critical regulators of Th17 cell pathogenicity. Cell; 163(6):1400–12. doi:10.1016/j.cell.2015.11.009. - DOI - PMC - PubMed
    1. Kowalczyk MS, Tirosh I, Heckl D, Rao TN, Dixit A, Haas BJ, Schneider RK, Wagers AJ, Ebert BL, Regev A. Single-cell RNA-seq reveals changes in cell cycle and differentiation programs upon aging of hematopoietic stem cells. Genome Res. 2015;25(12):1860–1872. doi: 10.1101/gr.192237.115. - DOI - PMC - PubMed
    1. Lande-Diner L, Stewart-Ornstein J, Weitz CJ, Lahav G. Single-cell analysis of circadian dynamics in tissue explants. Mol Biol Cell. 2015;26(22):3940–945. doi: 10.1091/mbc.E15-06-0403. - DOI - PMC - PubMed

Publication types

LinkOut - more resources