Impact of similarity metrics on single-cell RNA-seq data clustering
- PMID: 30137247
- DOI: 10.1093/bib/bby076
Impact of similarity metrics on single-cell RNA-seq data clustering
Abstract
Advances in high-throughput sequencing on single-cell gene expressions [single-cell RNA sequencing (scRNA-seq)] have enabled transcriptome profiling on individual cells from complex samples. A common goal in scRNA-seq data analysis is to discover and characterise cell types, typically through clustering methods. The quality of the clustering therefore plays a critical role in biological discovery. While numerous clustering algorithms have been proposed for scRNA-seq data, fundamentally they all rely on a similarity metric for categorising individual cells. Although several studies have compared the performance of various clustering algorithms for scRNA-seq data, currently there is no benchmark of different similarity metrics and their influence on scRNA-seq data clustering. Here, we compared a panel of similarity metrics on clustering a collection of annotated scRNA-seq datasets. Within each dataset, a stratified subsampling procedure was applied and an array of evaluation measures was employed to assess the similarity metrics. This produced a highly reliable and reproducible consensus on their performance assessment. Overall, we found that correlation-based metrics (e.g. Pearson's correlation) outperformed distance-based metrics (e.g. Euclidean distance). To test if the use of correlation-based metrics can benefit the recently published clustering techniques for scRNA-seq data, we modified a state-of-the-art kernel-based clustering algorithm (SIMLR) using Pearson's correlation as a similarity measure and found significant performance improvement over Euclidean distance on scRNA-seq data clustering. These findings demonstrate the importance of similarity metrics in clustering scRNA-seq data and highlight Pearson's correlation as a favourable choice. Further comparison on different scRNA-seq library preparation protocols suggests that they may also affect clustering performance. Finally, the benchmarking framework is available at http://www.maths.usyd.edu.au/u/SMS/bioinformatics/software.html.
Keywords: clustering; correlation; distance; scRNA-seq; similarity metric; single-cell RNA-seq.
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Similar articles
-
Autoencoder-based cluster ensembles for single-cell RNA-seq data analysis.BMC Bioinformatics. 2019 Dec 24;20(Suppl 19):660. doi: 10.1186/s12859-019-3179-5. BMC Bioinformatics. 2019. PMID: 31870278 Free PMC article.
-
How does the structure of data impact cell-cell similarity? Evaluating how structural properties influence the performance of proximity metrics in single cell RNA-seq data.Brief Bioinform. 2022 Nov 19;23(6):bbac387. doi: 10.1093/bib/bbac387. Brief Bioinform. 2022. PMID: 36151725 Free PMC article.
-
scNPF: an integrative framework assisted by network propagation and network fusion for preprocessing of single-cell RNA-seq data.BMC Genomics. 2019 May 8;20(1):347. doi: 10.1186/s12864-019-5747-5. BMC Genomics. 2019. PMID: 31068142 Free PMC article.
-
Clustering and classification methods for single-cell RNA-sequencing data.Brief Bioinform. 2020 Jul 15;21(4):1196-1208. doi: 10.1093/bib/bbz062. Brief Bioinform. 2020. PMID: 31271412 Free PMC article. Review.
-
Machine learning and statistical methods for clustering single-cell RNA-sequencing data.Brief Bioinform. 2020 Jul 15;21(4):1209-1223. doi: 10.1093/bib/bbz063. Brief Bioinform. 2020. PMID: 31243426 Review.
Cited by
-
Evaluation of tea (Camellia sinensis L.) phytochemicals as multi-disease modulators, a multidimensional in silico strategy with the combinations of network pharmacology, pharmacophore analysis, statistics and molecular docking.Mol Divers. 2023 Feb;27(1):487-509. doi: 10.1007/s11030-022-10437-1. Epub 2022 May 10. Mol Divers. 2023. PMID: 35536529 Free PMC article.
-
Multi-omics Analysis of Microenvironment Characteristics and Immune Escape Mechanisms of Hepatocellular Carcinoma.Front Oncol. 2019 Oct 15;9:1019. doi: 10.3389/fonc.2019.01019. eCollection 2019. Front Oncol. 2019. PMID: 31681571 Free PMC article.
-
q-Diffusion leverages the full dimensionality of gene coexpression in single-cell transcriptomics.Commun Biol. 2024 Apr 2;7(1):400. doi: 10.1038/s42003-024-06104-w. Commun Biol. 2024. PMID: 38565955 Free PMC article.
-
Transfer learning for clustering single-cell RNA-seq data crossing-species and batch, case on uterine fibroids.Brief Bioinform. 2023 Nov 22;25(1):bbad426. doi: 10.1093/bib/bbad426. Brief Bioinform. 2023. PMID: 37991248 Free PMC article.
-
Accurate feature selection improves single-cell RNA-seq cell clustering.Brief Bioinform. 2021 Sep 2;22(5):bbab034. doi: 10.1093/bib/bbab034. Brief Bioinform. 2021. PMID: 33611426 Free PMC article.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources