SCClone: Accurate Clustering of Tumor Single-Cell DNA Sequencing Data

Zhenhua Yu^{1

2}, Fang Du^{1

2}, Lijuan Song^{1

2}

Affiliations

¹ School of Information Engineering, Ningxia University, Yinchuan, China.
² Collaborative Innovation Center for Ningxia Big Data and Artificial Intelligence Co-founded by Ningxia Municipality and Ministry of Education, Ningxia University, Yinchuan, China.

PMID: 35154282
PMCID: PMC8830741
DOI: 10.3389/fgene.2022.823941

SCClone: Accurate Clustering of Tumor Single-Cell DNA Sequencing Data

Zhenhua Yu et al. Front Genet. 2022.

. 2022 Jan 27:13:823941.

doi: 10.3389/fgene.2022.823941. eCollection 2022.

Authors

Zhenhua Yu^{1

2}, Fang Du^{1

2}, Lijuan Song^{1

2}

Affiliations

¹ School of Information Engineering, Ningxia University, Yinchuan, China.
² Collaborative Innovation Center for Ningxia Big Data and Artificial Intelligence Co-founded by Ningxia Municipality and Ministry of Education, Ningxia University, Yinchuan, China.

PMID: 35154282
PMCID: PMC8830741
DOI: 10.3389/fgene.2022.823941

Abstract

Single-cell DNA sequencing (scDNA-seq) enables high-resolution profiling of genetic diversity among single cells and is especially useful for deciphering the intra-tumor heterogeneity and evolutionary history of tumor. Specific technical issues such as allele dropout, false-positive errors, and doublets make scDNA-seq data incomplete and error-prone, giving rise to a severe challenge of accurately inferring clonal architecture of tumor. To effectively address these issues, we introduce a new computational method called SCClone for reasoning subclones from single nucleotide variation (SNV) data of single cells. Specifically, SCClone leverages a probability mixture model for binary data to cluster single cells into distinct subclones. To accurately decipher underlying clonal composition, a novel model selection scheme based on inter-cluster variance is employed to find the optimal number of subclones. Extensive evaluations on various simulated datasets suggest SCClone has strong robustness against different technical noises in scDNA-seq data and achieves better performance than the state-of-the-art methods in reasoning clonal composition. Further evaluations of SCClone on three real scDNA-seq datasets show that it can effectively find the underlying subclones from severely disturbed data. The SCClone software is freely available at https://github.com/qasimyu/scclone.

Keywords: cancer genome; clustering; intra-tumor heterogeneity; next-generation sequencing; single-cell sequencing.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**FIGURE 1**
Schematic overview of the SCClone framework. SCClone takes noisy genotype matrix D inferred from scDNA-seq data as input and infers the number of subclones as well as the genotypes of each subclone. Each element of D _ij denotes the state (presence, absence, and unobserved) of the jth mutation in the ith cell, and the different states are marked by black, white, and gray, respectively.

**FIGURE 2**
Performance evaluation results on the simulated datasets D1 and D2. The dataset D1 consists of 500 × 200 genotype matrices with the false negative rate changing from 0.2 to 0.5, and dataset D2 is constituted by 500 × 200 genotype matrices with the missing rate ranging from 0.2 to 0.5. Four performance metrics including V-measure, accuracy, sensitivity, and specificity are measured to examine the effects of false-negative errors and missing entries on inference accuracy.

**FIGURE 3**
Performance evaluation results on the simulated dataset D3. The dataset consists of 1,000 × 500 genotype matrices with the false negative rate being as high as 0.8.

**FIGURE 4**
Performance evaluation results on the simulated datasets D4 and D5. The dataset D4 consists of 1,000 × 500 genotype matrices with the number of subclones changing from 20 to 50, and dataset D5 is constituted by genotype matrices with the number of cells ranging from 500 to 2,000.

**FIGURE 5**
Number of subclones estimated by SCClone, BnpC, SCG, and RobustClone on the simulated dataset D4. The simulated number of subclones changes from 20 to 50. ΔK denotes the difference between predicted and expected number of subclones.

**FIGURE 6**
Performance evaluation results on the simulated dataset D6. The dataset consists of 200 × 50 genotype matrices with five subclones.

**FIGURE 7**
Error rate estimation results of SCClone on the simulated dataset D7. The simulated FPR α changes from 0.01 to 0.1, and FNR β changes from 0.05 to 0.4.

**FIGURE 8**
Subclones and their lineage relationship inferred from metastatic colorectal cancer dataset. SCClone identifies five subclones. subclone1 represents normal population without mutations, subclone2 consists of mutated diploid cells, subclone3 and subclone4 are constituted by primary aneuploid cells, and subclone5 represents metastatic aneuploid cells. The estimated FPR α and FNR β are 0.96 and 14.46%, respectively.

**FIGURE 9**
Subclones inferred from high grade serous ovarian cancer dataset. SCClone identifies five subclones and estimates the error rates as α = 2.26% and β = 31.61%. **(A)** The observed 420 × 43 genotype matrix. **(B)** The recovered genotype matrix and inferred subclones. **(C)** Distribution of the cells among subclones. **(D)** Constructed lineage relationship of subclones by building the minimum spanning tree.

**FIGURE 10**
Subclones inferred from IDH-mutant gliomas dataset. SCClone identifies 18 subclones and estimates the error rates as α = 0.64% and β = 82.38%. **(A)** The observed 926 × 1,392 genotype matrix. **(B)** The recovered genotype matrix and inferred subclones by SCClone. **(C)** The recovered genotype matrix and inferred subclones by BnpC.

See this image and copyright information in PMC

References

1. Borgsmüller N., Bonet J., Marass F., Gonzalez-Perez A., Lopez-Bigas N., Beerenwinkel N. (2020). BnpC: Bayesian Non-parametric Clustering of Single-Cell Mutation Profiles. Bioinformatics 36, 4854–4859. 10.1093/bioinformatics/btaa599 - DOI - PMC - PubMed
1. Chen Z., Gong F., Wan L., Ma L. (2020). RobustClone: a Robust PCA Method for Tumor Clone and Evolution Inference from Single-Cell Sequencing Data. Bioinformatics 36, 3299–3306. 10.1093/bioinformatics/btaa172 - DOI - PubMed
1. Ciccolella S., Patterson M., Bonizzoni P., Della Vedova G. (2021a). Effective Clustering for Single Cell Sequencing Cancer Data. IEEE J. Biomed. Health Inform. 25, 4068–4078. 10.1109/jbhi.2021.3081380 - DOI - PubMed
1. Ciccolella S., Ricketts C., Soto Gomez M., Patterson M., Silverbush D., Bonizzoni P., et al. (2021b). Inferring Cancer Progression from Single-Cell Sequencing while Allowing Mutation Losses. Bioinformatics 37, 326–333. 10.1093/bioinformatics/btaa722 - DOI - PMC - PubMed
1. Ding L., Ley T. J., Larson D. E., Miller C. A., Koboldt D. C., Welch J. S., et al. (2012). Clonal Evolution in Relapsed Acute Myeloid Leukaemia Revealed by Whole-Genome Sequencing. Nature 481, 506–510. 10.1038/nature10738 - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SCClone: Accurate Clustering of Tumor Single-Cell DNA Sequencing Data

Affiliations

SCClone: Accurate Clustering of Tumor Single-Cell DNA Sequencing Data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources