Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Sep 29;17(9):e0275472.
doi: 10.1371/journal.pone.0275472. eCollection 2022.

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Affiliations

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Y-H Taguchi et al. PLoS One. .

Abstract

Identifying differentially expressed genes is difficult because of the small number of available samples compared with the large number of genes. Conventional gene selection methods employing statistical tests have the critical problem of heavy dependence of P-values on sample size. Although the recently proposed principal component analysis (PCA) and tensor decomposition (TD)-based unsupervised feature extraction (FE) has often outperformed these statistical test-based methods, the reason why they worked so well is unclear. In this study, we aim to understand this reason in the context of projection pursuit (PP) that was proposed a long time ago to solve the problem of dimensions; we can relate the space spanned by singular value vectors with that spanned by the optimal cluster centroids obtained from K-means. Thus, the success of PCA- and TD-based unsupervised FE can be understood by this equivalence. In addition to this, empirical threshold adjusted P-values of 0.01 assuming the null hypothesis that singular value vectors attributed to genes obey the Gaussian distribution empirically corresponds to threshold-adjusted P-values of 0.1 when the null distribution is generated by gene order shuffling. For this purpose, we newly applied PP to the three data sets to which PCA and TD based unsupervised FE were previously applied; these data sets treated two topics, biomarker identification for kidney cancers (the first two) and the drug discovery for COVID-19 (the thrid one). Then we found the coincidence between PP and PCA or TD based unsupervised FE is pretty well. Shuffling procedures described above are also successfully applied to these three data sets. These findings thus rationalize the success of PCA- and TD-based unsupervised FE for the first time.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Histogram of raw P-values computed using the null distribution generated by shuffling when miRNAs in the first data set were considered.
(A) All miRNAs (B) Top 500 most expressive miRNAs.
Fig 2
Fig 2. Histogram of raw P-values computed using the null distribution generated by shuffling when the mRNAs in the first data set were considered.
(A) All mRNAs (B) Top 3000 most expressive mRNAs.
Fig 3
Fig 3. Histogram of raw P-values computed using the null distribution generated by shuffling when genes in the third data set were considered.
(A) All genes (B) Top 2780 most expressive genes.
Fig 4
Fig 4. QQplot between P-values computed by TD-based unsupervised FE and projection (A) mRNA in the first data set (B) miRNA in the first data set (C) mRNA in the second data set (D) miRNA in the second data set.
Fig 5
Fig 5. QQplot of P-values between TD-based unsupervised FE and PP (the third data set).
Fig 6
Fig 6. Histogram of raw P-values computed using the null distribution generated by shuffling when the second data set were considered.
(A) All miRNAs (B) All mRNAs.
Fig 7
Fig 7. Discussion of work flow used in this study.
Tensor decomposition (HOSVD) was applied to tenors and using obtained singular value vectors assumed to obey Gaussian distribution, P-values are attributed to genes. The genes associated with adjusted P-values less than 0.01 are selected. P-values are also computed by shuffling and the genes associated with adjusted P-values less than 0.1 are well coincident with the genes selected by HOSVD. The correspondence between singular value vectors and K-means applied to unfolded matrices is also discussed.
Fig 8
Fig 8. Comparisons between yjkm and either v5(jkm) or u1ju2ku1m.
Red straight lines indicate linear regressions.

References

    1. Fang Z, Martin J, Wang Z. Statistical methods for identifying differentially expressed genes in RNA-Seq experiments. Cell & Bioscience. 2012;2(1):26. doi: 10.1186/2045-3701-2-26 - DOI - PMC - PubMed
    1. Chen JJ, Wang SJ, Tsai CA, Lin CJ. Selection of differentially expressed genes in microarray data analysis. The Pharmacogenomics Journal. 2006;7(3):212–220. doi: 10.1038/sj.tpj.6500412 - DOI - PubMed
    1. Taguchi YH. Unsupervised Feature Extraction Applied to Bioinformatics. Springer International Publishing; 2020. Available from: 10.1007/978-3-030-22456-1. - DOI
    1. Tibshirani R. Regression Shrinkage and Selection Via the Lasso. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B. 1994;58:267–288.
    1. Huber PJ. Projection Pursuit. The Annals of Statistics. 1985;13(2):435–475. doi: 10.1214/aos/1176349519 - DOI

Publication types