Impact of similarity metrics on single-cell RNA-seq data clustering

Taiyun Kim¹, Irene Rui Chen¹, Yingxin Lin¹, Andy Yi-Yang Wang², Jean Yee Hwa Yang¹, Pengyi Yang¹

Affiliations

¹ School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia.
² Department of Anaesthesia, The University of Sydney Northern Clinical School, The University of Sydney, Sydney, NSW 2006, Australia.

PMID: 30137247
DOI: 10.1093/bib/bby076

Impact of similarity metrics on single-cell RNA-seq data clustering

Taiyun Kim et al. Brief Bioinform. 2019.

. 2019 Nov 27;20(6):2316-2326.

doi: 10.1093/bib/bby076.

Authors

Taiyun Kim¹, Irene Rui Chen¹, Yingxin Lin¹, Andy Yi-Yang Wang², Jean Yee Hwa Yang¹, Pengyi Yang¹

Affiliations

¹ School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia.
² Department of Anaesthesia, The University of Sydney Northern Clinical School, The University of Sydney, Sydney, NSW 2006, Australia.

PMID: 30137247
DOI: 10.1093/bib/bby076

Abstract

Advances in high-throughput sequencing on single-cell gene expressions [single-cell RNA sequencing (scRNA-seq)] have enabled transcriptome profiling on individual cells from complex samples. A common goal in scRNA-seq data analysis is to discover and characterise cell types, typically through clustering methods. The quality of the clustering therefore plays a critical role in biological discovery. While numerous clustering algorithms have been proposed for scRNA-seq data, fundamentally they all rely on a similarity metric for categorising individual cells. Although several studies have compared the performance of various clustering algorithms for scRNA-seq data, currently there is no benchmark of different similarity metrics and their influence on scRNA-seq data clustering. Here, we compared a panel of similarity metrics on clustering a collection of annotated scRNA-seq datasets. Within each dataset, a stratified subsampling procedure was applied and an array of evaluation measures was employed to assess the similarity metrics. This produced a highly reliable and reproducible consensus on their performance assessment. Overall, we found that correlation-based metrics (e.g. Pearson's correlation) outperformed distance-based metrics (e.g. Euclidean distance). To test if the use of correlation-based metrics can benefit the recently published clustering techniques for scRNA-seq data, we modified a state-of-the-art kernel-based clustering algorithm (SIMLR) using Pearson's correlation as a similarity measure and found significant performance improvement over Euclidean distance on scRNA-seq data clustering. These findings demonstrate the importance of similarity metrics in clustering scRNA-seq data and highlight Pearson's correlation as a favourable choice. Further comparison on different scRNA-seq library preparation protocols suggests that they may also affect clustering performance. Finally, the benchmarking framework is available at http://www.maths.usyd.edu.au/u/SMS/bioinformatics/software.html.

Keywords: clustering; correlation; distance; scRNA-seq; similarity metric; single-cell RNA-seq.

PubMed Disclaimer

Cited by

Evaluation of tea (Camellia sinensis L.) phytochemicals as multi-disease modulators, a multidimensional in silico strategy with the combinations of network pharmacology, pharmacophore analysis, statistics and molecular docking.
Nag A, Dhull N, Gupta A. Nag A, et al. Mol Divers. 2023 Feb;27(1):487-509. doi: 10.1007/s11030-022-10437-1. Epub 2022 May 10. Mol Divers. 2023. PMID: 35536529 Free PMC article.
Multi-omics Analysis of Microenvironment Characteristics and Immune Escape Mechanisms of Hepatocellular Carcinoma.
Li W, Wang H, Ma Z, Zhang J, Ou-Yang W, Qi Y, Liu J. Li W, et al. Front Oncol. 2019 Oct 15;9:1019. doi: 10.3389/fonc.2019.01019. eCollection 2019. Front Oncol. 2019. PMID: 31681571 Free PMC article.
q-Diffusion leverages the full dimensionality of gene coexpression in single-cell transcriptomics.
Marmarelis MG, Littman R, Battaglin F, Niedzwiecki D, Venook A, Ambite JL, Galstyan A, Lenz HJ, Ver Steeg G. Marmarelis MG, et al. Commun Biol. 2024 Apr 2;7(1):400. doi: 10.1038/s42003-024-06104-w. Commun Biol. 2024. PMID: 38565955 Free PMC article.
Transfer learning for clustering single-cell RNA-seq data crossing-species and batch, case on uterine fibroids.
Wang YM, Sun Y, Wang B, Wu Z, He XY, Zhao Y. Wang YM, et al. Brief Bioinform. 2023 Nov 22;25(1):bbad426. doi: 10.1093/bib/bbad426. Brief Bioinform. 2023. PMID: 37991248 Free PMC article.
Accurate feature selection improves single-cell RNA-seq cell clustering.
Su K, Yu T, Wu H. Su K, et al. Brief Bioinform. 2021 Sep 2;22(5):bbab034. doi: 10.1093/bib/bbab034. Brief Bioinform. 2021. PMID: 33611426 Free PMC article.

See all "Cited by" articles

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Ovid Technologies, Inc.
- Silverchair Information Systems
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Impact of similarity metrics on single-cell RNA-seq data clustering

Affiliations

Impact of similarity metrics on single-cell RNA-seq data clustering

Authors

Affiliations

Abstract

Similar articles

Cited by

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources