. 2022 Oct 26:20:6375-6387.

doi: 10.1016/j.csbj.2022.10.029. eCollection 2022.

Evaluation of single-cell RNA-seq clustering algorithms on cancer tumor datasets

Alaina Mahalanabis¹, Andrei L Turinsky¹, Mia Husić¹, Erik Christensen^{2

3}, Ping Luo⁴, Alaine Naidas^{3

5}, Michael Brudno^{1

6

7}, Trevor Pugh^{4

8

9}, Arun K Ramani¹, Parisa Shooshtari^{2

3

5

8}

Affiliations

¹ Centre for Computational Medicine, The Hospital for Sick Children, Toronto, ON, Canada.
² Department of Computer Science, University of Western Ontario, London, ON, Canada.
³ Children's Health Research Institute, Lawson Health Research Institute, London, ON, Canada.
⁴ Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada.
⁵ Department of Pathology and Laboratory Medicine, University of Western Ontario, London, ON, Canada.
⁶ Techna Institute, University Health Network, Toronto, Canada.
⁷ Department of Computer Science, University of Toronto, Toronto, Canada.
⁸ Ontario Institute for Cancer Research, Toronto, ON, Canada.
⁹ Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada.

PMID: 36420149
PMCID: PMC9677128
DOI: 10.1016/j.csbj.2022.10.029

Evaluation of single-cell RNA-seq clustering algorithms on cancer tumor datasets

Alaina Mahalanabis et al. Comput Struct Biotechnol J. 2022.

. 2022 Oct 26:20:6375-6387.

doi: 10.1016/j.csbj.2022.10.029. eCollection 2022.

Authors

Affiliations

¹ Centre for Computational Medicine, The Hospital for Sick Children, Toronto, ON, Canada.
² Department of Computer Science, University of Western Ontario, London, ON, Canada.
³ Children's Health Research Institute, Lawson Health Research Institute, London, ON, Canada.
⁴ Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada.
⁵ Department of Pathology and Laboratory Medicine, University of Western Ontario, London, ON, Canada.
⁶ Techna Institute, University Health Network, Toronto, Canada.
⁷ Department of Computer Science, University of Toronto, Toronto, Canada.
⁸ Ontario Institute for Cancer Research, Toronto, ON, Canada.
⁹ Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada.

PMID: 36420149
PMCID: PMC9677128
DOI: 10.1016/j.csbj.2022.10.029

Abstract

Tumors are complex biological entities that comprise cell types of different origins, with different mutational profiles and different patterns of transcriptional dysregulation. The exploration of data related to cancer biology requires careful analytical methods to reflect the heterogeneity of cell populations in cancer samples. Single-cell techniques are now able to capture the transcriptional profiles of individual cells. However, the complexity of RNA-seq data, especially in cancer samples, makes it challenging to cluster single-cell profiles into groups that reflect the underlying cell types. We have developed a framework for a systematic examination of single-cell RNA-seq clustering algorithms for cancer data, which uses a range of well-established metrics to generate a unified quality score and algorithm ranking. To demonstrate this framework, we examined clustering performance of 15 different single-cell RNA-seq clustering algorithms on eight different cancer datasets. Our results suggest that the single-cell RNA-seq clustering algorithms fall into distinct groups by performance, with the highest clustering quality on non-malignant cells achieved by three algorithms: Seurat, bigSCale and Cell Ranger. However, for malignant cells, two additional algorithms often reach a better performance, namely Monocle and SC3. Their ability to detect known rare cell types was also among the best, along with Seurat. Our approach and results can be used by a broad audience of practitioners who analyze single-cell transcriptomic data in cancer research.

Keywords: Automated algorithms; Cancer; Clustering; Framework; Single-Cell RNA-seq.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

**Fig. 1**
The main analysis workflow consisted of four stages. First, the clustering algorithms were applied to the eight cancer datasets to generate clustering partitions (blue). Then seven different metrics of clustering quality were examined and grouped into three distinct groups by similarity (yellow). By combining three representative measures, one per group, we generated quality scores first for each clustering partition and then for each algorithm (green). Finally, we ranked the algorithms by quality scores for each choice of measures, and then combined these ranks into a final ranking (pink). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

**Fig. 2**
Clustering quality was assessed using seven different measures for each pair of algorithm and dataset: AMI, ARI, F-measure, homogeneity, majority, silhouette and VI distance. Principal component (PC) analysis with feature scaling was then performed on the collection of 102 clustering partitions in the space defined by the quality measures. The heatmap shows absolute Pearson correlation among the seven different measures, as well as the top three principal components (PCs). The latter collectively explain over 90% of the variance in the measurement data, as indicated in their labels. A group of four different measures: AMI, ARI, F-measure and VI are best correlated with PC1, which captures 61% of the variation. Two other measures, the homogeneity and majority, are also highly correlated and are best reflected by PC2. The remaining silhouette measure is represented by PC3.

**Fig. 3**
Example of the ranking of the 15 clustering algorithms based on eight different combinations of the three metrics used in the generation of the summary quality score. For each dataset-algorithm pair, the three representative measures (e.g. AMI, homogeneity and silhouette) were converted into quantile values based on the three respective data distributions. Thereafter for each dataset-algorithm pair, a median of the three quantile-normalized measures was generated, and is shown in the heatmap using the color-coded scale. The heatmap rows are then ranked by their median-*per*-row values, with the best performing algorithms shown at the top of the heatmap. The heatmap also shows that the datasets differ significantly in terms of the clustering quality: for example, most algorithms have better performance achieved on the Glioblastoma dataset but the poorer performance on the Melanoma dataset.

**Fig. 4**
Distribution of ranks for each of the 15 algorithms, based on eight different combinations of the three metrics used in the generation of the summary quality score, repeated in 10,000 randomized iterations. Each box in the boxplot thus represents 80,000 values of rank. The algorithms are sorted by the median rank. They fell into three categories (indicated with orange, blue, purple) based on their performance on the non-malignant cells. The top three algorithms are Seurat, bigSCale, and Cell Ranger. Fractional ranks represent ties. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

**Fig. 5**
Distribution of ranks for each of the 15 algorithms applied to the AML dataset only, based on eight different combinations of metrics used in the generation of the summary quality score, repeated in 10,000 randomized iterations. Each box in the boxplot thus represents 80,000 values of rank. Algorithms fall into three categories (indicated with orange, blue, purple) based on their performance on the non-malignant cells. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

**Fig. 6**
A visual representation of the malignant component of the AML dataset is shown using tSNE, with individual malignant cells represented with colored dots. The colors represent either the clusters detected by either the top ranked algorithms (bigSCale, SC3, Cell Ranger; left side panels); or the cell groups used as benchmarks, representing true malignant cell types, inferCNV groups or patient ID groups (right side panels).

**Fig. 7**
The heatmap represents the F-measure of detecting each cell type (rows) in each dataset, either by clustering all cells or only non-malignant cells (columns). The left-most column represents the median values across all dataset versions.

**Fig. 8**
The heatmap represents the timing in minutes for each algorithm (rows) in each dataset (columns), by clustering malignant cells (left) and non-tumor cells (right). The left-most column in each heatmap represents the median values across all dataset versions.

See this image and copyright information in PMC

Cited by

Data-driven selection of analysis decisions in single-cell RNA-seq trajectory inference.
Dong X, Leary JR, Yang C, Brusko MA, Brusko TM, Bacher R. Dong X, et al. bioRxiv [Preprint]. 2023 Dec 19:2023.12.18.572214. doi: 10.1101/2023.12.18.572214. bioRxiv. 2023. Update in: Brief Bioinform. 2024 Mar 27;25(3):bbae216. doi: 10.1093/bib/bbae216. PMID: 38187768 Free PMC article. Updated. Preprint.
Serial single-cell RNA sequencing unveils drug resistance and metastatic traits in stage IV breast cancer.
Otsuji K, Takahashi Y, Osako T, Kobayashi T, Takano T, Saeki S, Yang L, Baba S, Kumegawa K, Suzuki H, Noda T, Takeuchi K, Ohno S, Ueno T, Maruyama R. Otsuji K, et al. NPJ Precis Oncol. 2024 Oct 3;8(1):222. doi: 10.1038/s41698-024-00723-6. NPJ Precis Oncol. 2024. PMID: 39363009 Free PMC article.
A method for in silico exploration of potential glioblastoma multiforme attractors using single-cell RNA sequencing.
Vieira Junior MG, de Almeida Côrtes AM, Gonçalves Carneiro FR, Carels N, Silva FABD. Vieira Junior MG, et al. Sci Rep. 2024 Oct 29;14(1):26003. doi: 10.1038/s41598-024-74985-2. Sci Rep. 2024. PMID: 39472601 Free PMC article.

References

1. Hanahan D., Weinberg R.A. Hallmarks of cancer: the next generation. Cell. 2011;144:646–674. - PubMed
1. Joyce J.A., Pollard J.W. Microenvironmental regulation of metastasis. Nat Rev Cancer. 2009;9:239–252. - PMC - PubMed
1. Lawson D.A., Kessenbrock K., Davis R.T., Pervolarakis N., Werb Z. Tumour heterogeneity and metastasis at single-cell resolution. Nat Cell Biol. 2018;20:1349–1360. - PMC - PubMed
1. Tirosh I., et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science. 2016;352:189–196. - PMC - PubMed
1. Meacham C.E., Morrison S.J. Tumour heterogeneity and cancer cell plasticity. Nature. 2013;501:328–337. - PMC - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluation of single-cell RNA-seq clustering algorithms on cancer tumor datasets

Affiliations

Evaluation of single-cell RNA-seq clustering algorithms on cancer tumor datasets

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources