Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 May 23:14:1166404.
doi: 10.3389/fgene.2023.1166404. eCollection 2023.

Subject clustering by IF-PCA and several recent methods

Affiliations

Subject clustering by IF-PCA and several recent methods

Dieyi Chen et al. Front Genet. .

Abstract

Subject clustering (i.e., the use of measured features to cluster subjects, such as patients or cells, into multiple groups) is a problem of significant interest. In recent years, many approaches have been proposed, among which unsupervised deep learning (UDL) has received much attention. Two interesting questions are 1) how to combine the strengths of UDL and other approaches and 2) how these approaches compare to each other. We combine the variational auto-encoder (VAE), a popular UDL approach, with the recent idea of influential feature-principal component analysis (IF-PCA) and propose IF-VAE as a new method for subject clustering. We study IF-VAE and compare it with several other methods (including IF-PCA, VAE, Seurat, and SC3) on 10 gene microarray data sets and eight single-cell RNA-seq data sets. We find that IF-VAE shows significant improvement over VAE, but still underperforms compared to IF-PCA. We also find that IF-PCA is quite competitive, slightly outperforming Seurat and SC3 over the eight single-cell data sets. IF-PCA is conceptually simple and permits delicate analysis. We demonstrate that IF-PCA is capable of achieving phase transition in a rare/weak model. Comparatively, Seurat and SC3 are more complex and theoretically difficult to analyze (for these reasons, their optimality remains unclear).

Keywords: PCA; ScRNA-seq; feature selection; gene microarray; higher criticism threshold; sparsity; subject clustering; variational.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Clustering errors of IF-VAE(X) as a function of the number of selected features in the IF step (data set: LungCancer (1); y-axis: number of clustering errors; x-axis: number of selected features).
FIGURE 2
FIGURE 2
Phase transition for PCA and IF-PCA (θ = 0.6). The (three-segment) solid green line is α = α*(β, θ), which separates the whole region into the Region of Impossibility (top) and Region of Possibility (bottom). In the part of Region of Possibility (β < 1/2), feature selection is infeasible, PCA is optimal, and IF-PCA reduces to PCA with an appropriate threshold. In the right part (β > 1/2), it is desirable to conduct feature selection, and IF-PCA is optimal. However, PCA is non-optimal for parameters in the shaded green region.

Similar articles

References

    1. Abbe E., Fan J., Wang K., Zhong Y. (2020). Entrywise eigenvector analysis of random matrices with low expected rank. Ann. statistics 48, 1452–1474. 10.1214/19-aos1854 - DOI - PMC - PubMed
    1. Abramovich F., Benjamini Y., Donoho D., Johnstone I. (2006). Adapting to unknown sparsity by controlling the false discovery rate. Ann. Statistics 34, 584–653. 10.1214/009053606000000074 - DOI
    1. Arthur D., Vassilvitskii S. (2007). “k-means++: The advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, New Orleans, Louisiana, USA, January 7-9, 2007, 1027–1035.
    1. Barnett I., Mukherjee R., Lin X. (2017). The generalized higher criticism for testing snp-set effects in genetic association studies. J. Am. Stat. Assoc. 112, 64–76. 10.1080/01621459.2016.1192039 - DOI - PMC - PubMed
    1. Cai T. T., Ma R. (2022). Theoretical foundations of t-sne for visualizing high-dimensional clustered data. J. Mach. Learn. Resarch 23, 1–54.