. 2023 May 23:14:1166404.

doi: 10.3389/fgene.2023.1166404. eCollection 2023.

Subject clustering by IF-PCA and several recent methods

Dieyi Chen¹, Jiashun Jin², Zheng Tracy Ke¹

Affiliations

¹ Department of Statistics, Harvard University, Cambridge, MA, United States.
² Department of Statistics, Carnegie Mellon University, Pittsburgh, PA, United States.

PMID: 37287536
PMCID: PMC10242062
DOI: 10.3389/fgene.2023.1166404

Subject clustering by IF-PCA and several recent methods

Dieyi Chen et al. Front Genet. 2023.

. 2023 May 23:14:1166404.

doi: 10.3389/fgene.2023.1166404. eCollection 2023.

Authors

Dieyi Chen¹, Jiashun Jin², Zheng Tracy Ke¹

Affiliations

¹ Department of Statistics, Harvard University, Cambridge, MA, United States.
² Department of Statistics, Carnegie Mellon University, Pittsburgh, PA, United States.

PMID: 37287536
PMCID: PMC10242062
DOI: 10.3389/fgene.2023.1166404

Abstract

Subject clustering (i.e., the use of measured features to cluster subjects, such as patients or cells, into multiple groups) is a problem of significant interest. In recent years, many approaches have been proposed, among which unsupervised deep learning (UDL) has received much attention. Two interesting questions are 1) how to combine the strengths of UDL and other approaches and 2) how these approaches compare to each other. We combine the variational auto-encoder (VAE), a popular UDL approach, with the recent idea of influential feature-principal component analysis (IF-PCA) and propose IF-VAE as a new method for subject clustering. We study IF-VAE and compare it with several other methods (including IF-PCA, VAE, Seurat, and SC3) on 10 gene microarray data sets and eight single-cell RNA-seq data sets. We find that IF-VAE shows significant improvement over VAE, but still underperforms compared to IF-PCA. We also find that IF-PCA is quite competitive, slightly outperforming Seurat and SC3 over the eight single-cell data sets. IF-PCA is conceptually simple and permits delicate analysis. We demonstrate that IF-PCA is capable of achieving phase transition in a rare/weak model. Comparatively, Seurat and SC3 are more complex and theoretically difficult to analyze (for these reasons, their optimality remains unclear).

Keywords: PCA; ScRNA-seq; feature selection; gene microarray; higher criticism threshold; sparsity; subject clustering; variational.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**FIGURE 1**
Clustering errors of IF-VAE(X) as a function of the number of selected features in the IF step (data set: LungCancer (1); y-axis: number of clustering errors; x-axis: number of selected features).

**FIGURE 2**
Phase transition for PCA and IF-PCA (θ = 0.6). The (three-segment) solid green line is α = α*(β, θ), which separates the whole region into the Region of Impossibility (top) and Region of Possibility (bottom). In the part of Region of Possibility (β < 1/2), feature selection is infeasible, PCA is optimal, and IF-PCA reduces to PCA with an appropriate threshold. In the right part (β > 1/2), it is desirable to conduct feature selection, and IF-PCA is optimal. However, PCA is non-optimal for parameters in the shaded green region.

See this image and copyright information in PMC

References

1. Abbe E., Fan J., Wang K., Zhong Y. (2020). Entrywise eigenvector analysis of random matrices with low expected rank. Ann. statistics 48, 1452–1474. 10.1214/19-aos1854 - DOI - PMC - PubMed
1. Abramovich F., Benjamini Y., Donoho D., Johnstone I. (2006). Adapting to unknown sparsity by controlling the false discovery rate. Ann. Statistics 34, 584–653. 10.1214/009053606000000074 - DOI
1. Arthur D., Vassilvitskii S. (2007). “k-means++: The advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, New Orleans, Louisiana, USA, January 7-9, 2007, 1027–1035.
1. Barnett I., Mukherjee R., Lin X. (2017). The generalized higher criticism for testing snp-set effects in genetic association studies. J. Am. Stat. Assoc. 112, 64–76. 10.1080/01621459.2016.1192039 - DOI - PMC - PubMed
1. Cai T. T., Ma R. (2022). Theoretical foundations of t-sne for visualizing high-dimensional clustered data. J. Mach. Learn. Resarch 23, 1–54.

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Subject clustering by IF-PCA and several recent methods

Affiliations

Subject clustering by IF-PCA and several recent methods

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

References

Related information

LinkOut - more resources

Full Text Sources

Miscellaneous