Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct 26:20:6375-6387.
doi: 10.1016/j.csbj.2022.10.029. eCollection 2022.

Evaluation of single-cell RNA-seq clustering algorithms on cancer tumor datasets

Affiliations

Evaluation of single-cell RNA-seq clustering algorithms on cancer tumor datasets

Alaina Mahalanabis et al. Comput Struct Biotechnol J. .

Abstract

Tumors are complex biological entities that comprise cell types of different origins, with different mutational profiles and different patterns of transcriptional dysregulation. The exploration of data related to cancer biology requires careful analytical methods to reflect the heterogeneity of cell populations in cancer samples. Single-cell techniques are now able to capture the transcriptional profiles of individual cells. However, the complexity of RNA-seq data, especially in cancer samples, makes it challenging to cluster single-cell profiles into groups that reflect the underlying cell types. We have developed a framework for a systematic examination of single-cell RNA-seq clustering algorithms for cancer data, which uses a range of well-established metrics to generate a unified quality score and algorithm ranking. To demonstrate this framework, we examined clustering performance of 15 different single-cell RNA-seq clustering algorithms on eight different cancer datasets. Our results suggest that the single-cell RNA-seq clustering algorithms fall into distinct groups by performance, with the highest clustering quality on non-malignant cells achieved by three algorithms: Seurat, bigSCale and Cell Ranger. However, for malignant cells, two additional algorithms often reach a better performance, namely Monocle and SC3. Their ability to detect known rare cell types was also among the best, along with Seurat. Our approach and results can be used by a broad audience of practitioners who analyze single-cell transcriptomic data in cancer research.

Keywords: Automated algorithms; Cancer; Clustering; Framework; Single-Cell RNA-seq.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

None
Graphical abstract
Fig. 1
Fig. 1
The main analysis workflow consisted of four stages. First, the clustering algorithms were applied to the eight cancer datasets to generate clustering partitions (blue). Then seven different metrics of clustering quality were examined and grouped into three distinct groups by similarity (yellow). By combining three representative measures, one per group, we generated quality scores first for each clustering partition and then for each algorithm (green). Finally, we ranked the algorithms by quality scores for each choice of measures, and then combined these ranks into a final ranking (pink). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 2
Fig. 2
Clustering quality was assessed using seven different measures for each pair of algorithm and dataset: AMI, ARI, F-measure, homogeneity, majority, silhouette and VI distance. Principal component (PC) analysis with feature scaling was then performed on the collection of 102 clustering partitions in the space defined by the quality measures. The heatmap shows absolute Pearson correlation among the seven different measures, as well as the top three principal components (PCs). The latter collectively explain over 90% of the variance in the measurement data, as indicated in their labels. A group of four different measures: AMI, ARI, F-measure and VI are best correlated with PC1, which captures 61% of the variation. Two other measures, the homogeneity and majority, are also highly correlated and are best reflected by PC2. The remaining silhouette measure is represented by PC3.
Fig. 3
Fig. 3
Example of the ranking of the 15 clustering algorithms based on eight different combinations of the three metrics used in the generation of the summary quality score. For each dataset-algorithm pair, the three representative measures (e.g. AMI, homogeneity and silhouette) were converted into quantile values based on the three respective data distributions. Thereafter for each dataset-algorithm pair, a median of the three quantile-normalized measures was generated, and is shown in the heatmap using the color-coded scale. The heatmap rows are then ranked by their median-per-row values, with the best performing algorithms shown at the top of the heatmap. The heatmap also shows that the datasets differ significantly in terms of the clustering quality: for example, most algorithms have better performance achieved on the Glioblastoma dataset but the poorer performance on the Melanoma dataset.
Fig. 4
Fig. 4
Distribution of ranks for each of the 15 algorithms, based on eight different combinations of the three metrics used in the generation of the summary quality score, repeated in 10,000 randomized iterations. Each box in the boxplot thus represents 80,000 values of rank. The algorithms are sorted by the median rank. They fell into three categories (indicated with orange, blue, purple) based on their performance on the non-malignant cells. The top three algorithms are Seurat, bigSCale, and Cell Ranger. Fractional ranks represent ties. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 5
Fig. 5
Distribution of ranks for each of the 15 algorithms applied to the AML dataset only, based on eight different combinations of metrics used in the generation of the summary quality score, repeated in 10,000 randomized iterations. Each box in the boxplot thus represents 80,000 values of rank. Algorithms fall into three categories (indicated with orange, blue, purple) based on their performance on the non-malignant cells. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 6
Fig. 6
A visual representation of the malignant component of the AML dataset is shown using tSNE, with individual malignant cells represented with colored dots. The colors represent either the clusters detected by either the top ranked algorithms (bigSCale, SC3, Cell Ranger; left side panels); or the cell groups used as benchmarks, representing true malignant cell types, inferCNV groups or patient ID groups (right side panels).
Fig. 7
Fig. 7
The heatmap represents the F-measure of detecting each cell type (rows) in each dataset, either by clustering all cells or only non-malignant cells (columns). The left-most column represents the median values across all dataset versions.
Fig. 8
Fig. 8
The heatmap represents the timing in minutes for each algorithm (rows) in each dataset (columns), by clustering malignant cells (left) and non-tumor cells (right). The left-most column in each heatmap represents the median values across all dataset versions.

Similar articles

Cited by

References

    1. Hanahan D., Weinberg R.A. Hallmarks of cancer: the next generation. Cell. 2011;144:646–674. - PubMed
    1. Joyce J.A., Pollard J.W. Microenvironmental regulation of metastasis. Nat Rev Cancer. 2009;9:239–252. - PMC - PubMed
    1. Lawson D.A., Kessenbrock K., Davis R.T., Pervolarakis N., Werb Z. Tumour heterogeneity and metastasis at single-cell resolution. Nat Cell Biol. 2018;20:1349–1360. - PMC - PubMed
    1. Tirosh I., et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science. 2016;352:189–196. - PMC - PubMed
    1. Meacham C.E., Morrison S.J. Tumour heterogeneity and cancer cell plasticity. Nature. 2013;501:328–337. - PMC - PubMed

LinkOut - more resources