. 2022 Sep 29;17(9):e0275472.

doi: 10.1371/journal.pone.0275472. eCollection 2022.

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Y-H Taguchi¹, Turki Turki²

Affiliations

¹ Department of Physics, Chuo University, Bunkyo-ku, Tokyo, Japan.
² Department of Computer Science, King Abdulaziz University, Jeddah, Saudi Arabia.

PMID: 36173994
PMCID: PMC9521941
DOI: 10.1371/journal.pone.0275472

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Y-H Taguchi et al. PLoS One. 2022.

. 2022 Sep 29;17(9):e0275472.

doi: 10.1371/journal.pone.0275472. eCollection 2022.

Authors

Y-H Taguchi¹, Turki Turki²

Affiliations

¹ Department of Physics, Chuo University, Bunkyo-ku, Tokyo, Japan.
² Department of Computer Science, King Abdulaziz University, Jeddah, Saudi Arabia.

PMID: 36173994
PMCID: PMC9521941
DOI: 10.1371/journal.pone.0275472

Abstract

Identifying differentially expressed genes is difficult because of the small number of available samples compared with the large number of genes. Conventional gene selection methods employing statistical tests have the critical problem of heavy dependence of P-values on sample size. Although the recently proposed principal component analysis (PCA) and tensor decomposition (TD)-based unsupervised feature extraction (FE) has often outperformed these statistical test-based methods, the reason why they worked so well is unclear. In this study, we aim to understand this reason in the context of projection pursuit (PP) that was proposed a long time ago to solve the problem of dimensions; we can relate the space spanned by singular value vectors with that spanned by the optimal cluster centroids obtained from K-means. Thus, the success of PCA- and TD-based unsupervised FE can be understood by this equivalence. In addition to this, empirical threshold adjusted P-values of 0.01 assuming the null hypothesis that singular value vectors attributed to genes obey the Gaussian distribution empirically corresponds to threshold-adjusted P-values of 0.1 when the null distribution is generated by gene order shuffling. For this purpose, we newly applied PP to the three data sets to which PCA and TD based unsupervised FE were previously applied; these data sets treated two topics, biomarker identification for kidney cancers (the first two) and the drug discovery for COVID-19 (the thrid one). Then we found the coincidence between PP and PCA or TD based unsupervised FE is pretty well. Shuffling procedures described above are also successfully applied to these three data sets. These findings thus rationalize the success of PCA- and TD-based unsupervised FE for the first time.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Histogram of raw P-values computed using the null distribution generated by shuffling when miRNAs in the first data set were considered.**
(A) All miRNAs (B) Top 500 most expressive miRNAs.

**Fig 2. Histogram of raw P-values computed using the null distribution generated by shuffling when the mRNAs in the first data set were considered.**
(A) All mRNAs (B) Top 3000 most expressive mRNAs.

**Fig 3. Histogram of raw P-values computed using the null distribution generated by shuffling when genes in the third data set were considered.**
(A) All genes (B) Top 2780 most expressive genes.

Fig 4. QQplot between P-values computed by TD-based unsupervised FE and projection (A) mRNA in the first data set (B) miRNA in the first data set (C) mRNA in the second data set (D) miRNA in the second data set.

**Fig 5. QQplot of P-values between TD-based unsupervised FE and PP (the third data set).**

**Fig 6. Histogram of raw P-values computed using the null distribution generated by shuffling when the second data set were considered.**
(A) All miRNAs (B) All mRNAs.

**Fig 7. Discussion of work flow used in this study.**
Tensor decomposition (HOSVD) was applied to tenors and using obtained singular value vectors assumed to obey Gaussian distribution, P-values are attributed to genes. The genes associated with adjusted P-values less than 0.01 are selected. P-values are also computed by shuffling and the genes associated with adjusted P-values less than 0.1 are well coincident with the genes selected by HOSVD. The correspondence between singular value vectors and K-means applied to unfolded matrices is also discussed.

**Fig 8. Comparisons between y_jkm and either v_5(jkm) or u_1ju_2ku_1m.**
Red straight lines indicate linear regressions.

See this image and copyright information in PMC

References

1. Fang Z, Martin J, Wang Z. Statistical methods for identifying differentially expressed genes in RNA-Seq experiments. Cell & Bioscience. 2012;2(1):26. doi: 10.1186/2045-3701-2-26 - DOI - PMC - PubMed
1. Chen JJ, Wang SJ, Tsai CA, Lin CJ. Selection of differentially expressed genes in microarray data analysis. The Pharmacogenomics Journal. 2006;7(3):212–220. doi: 10.1038/sj.tpj.6500412 - DOI - PubMed
1. Taguchi YH. Unsupervised Feature Extraction Applied to Bioinformatics. Springer International Publishing; 2020. Available from: 10.1007/978-3-030-22456-1. - DOI
1. Tibshirani R. Regression Shrinkage and Selection Via the Lasso. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B. 1994;58:267–288.
1. Huber PJ. Projection Pursuit. The Annals of Statistics. 1985;13(2):435–475. doi: 10.1214/aos/1176349519 - DOI

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Affiliations

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous