Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Nov;25(11):4068-4078.
doi: 10.1109/JBHI.2021.3081380. Epub 2021 Nov 5.

Effective Clustering for Single Cell Sequencing Cancer Data

Effective Clustering for Single Cell Sequencing Cancer Data

Simone Ciccolella et al. IEEE J Biomed Health Inform. 2021 Nov.

Abstract

Single cell sequencing (SCS) technologies provide a level of resolution that makes it indispensable for inferring from a sequenced tumor, evolutionary trees or phylogenies representing an accumulation of cancerous mutations. A drawback of SCS is elevated false negative and missing value rates, resulting in a large space of possible solutions, which in turn makes it difficult, sometimes infeasible using current approaches and tools. One possible solution is to reduce the size of an SCS instance - usually represented as a matrix of presence, absence, and uncertainty of the mutations found in the different sequenced cells - and to infer the tree from this reduced-size instance. In this work, we present a new clustering procedure aimed at clustering such categorical vector, or matrix data - here representing SCS instances, called celluloid. We show that celluloid clusters mutations with high precision: never pairing too many mutations that are unrelated in the ground truth, but also obtains accurate results in terms of the phylogeny inferred downstream from the reduced instance produced by this method. We demonstrate the usefulness of a clustering step by applying the entire pipeline (clustering + inference method) to a real dataset, showing a significant reduction in the runtime, raising considerably the upper bound on the size of SCS instances which can be solved in practice. Our approach, celluloid: clustering single cell sequencing data around centroids is available at https://github.com/AlgoLab/celluloid/ under an MIT license, as well as on the Python Package Index (PyPI) at https://pypi.org/project/celluloid-clust/.

PubMed Disclaimer

Publication types

LinkOut - more resources