Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun 6:4:e43665.
doi: 10.2196/43665.

Decision of the Optimal Rank of a Nonnegative Matrix Factorization Model for Gene Expression Data Sets Utilizing the Unit Invariant Knee Method: Development and Evaluation of the Elbow Method for Rank Selection

Affiliations

Decision of the Optimal Rank of a Nonnegative Matrix Factorization Model for Gene Expression Data Sets Utilizing the Unit Invariant Knee Method: Development and Evaluation of the Elbow Method for Rank Selection

Emine Guven. JMIR Bioinform Biotechnol. .

Abstract

Background: There is a great need to develop a computational approach to analyze and exploit the information contained in gene expression data. The recent utilization of nonnegative matrix factorization (NMF) in computational biology has demonstrated the capability to derive essential details from a high amount of data in particular gene expression microarrays. A common problem in NMF is finding the proper number rank (r) of factors of the degraded demonstration, but no agreement exists on which technique is most appropriate to utilize for this purpose. Thus, various techniques have been suggested to select the optimal value of rank factorization (r).

Objective: In this work, a new metric for rank selection is proposed based on the elbow method, which was methodically compared against the cophenetic metric.

Methods: To decide the optimum number rank (r), this study focused on the unit invariant knee (UIK) method of the NMF on gene expression data sets. Since the UIK method requires an extremum distance estimator that is eventually employed for inflection and identification of a knee point, the proposed method finds the first inflection point of the curvature of the residual sum of squares of the proposed algorithms using the UIK method on gene expression data sets as a target matrix.

Results: Computation was conducted for the UIK task using gene expression data of acute lymphoblastic leukemia and acute myeloid leukemia samples. Consequently, the distinct results of NMF were subjected to comparison on different algorithms. The proposed UIK method is easy to perform, fast, free of a priori rank value input, and does not require initial parameters that significantly influence the model's functionality.

Conclusions: This study demonstrates that the elbow method provides a credible prediction for both gene expression data and for precisely estimating simulated mutational processes data with known dimensions. The proposed UIK method is faster than conventional methods, including metrics utilizing the consensus matrix as a criterion for rank selection, while achieving significantly better computational efficiency without visual inspection on the curvatives. Finally, the suggested rank tuning method based on the elbow method for gene expression data is arguably theoretically superior to the cophenetic measure.

Keywords: consensus matrix; elbow method; gene expression data; nonnegative matrix factorization; optimal rank; rank factorization; unit invariant knee method.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

Figure 1
Figure 1
(A) Rank survey plots for residual sum of squares (RSS) and (B) cophenetic coefficient curves factorization rank. The factorization rank ranges from 3 to 37. The aim is to decide whether the optimal rank factorization is very rigid by simple visual inspection. (C) The function of factorization rank is selected as the emergence rank of the RSS survey. The rank range between knee points is detected by the uik() function of the R package "inflection" at the curve of the cumulative rank units. The best fit is determined using a linear regression model.
Figure 2
Figure 2
Application of the unit invariant knee (UIK) method on different algorithms: (A) “Brunet” and (B) “nsNMF.” The optimal rank, which UIK represents, is 15 for the Brunet algorithm, whereas the UIK of the nsNMF algorithm reveals 14 as an optimum rank, similar to the “Lee” algorithm.
Figure 3
Figure 3
(A) Estimation of the optimal rank. Nonnegative matrix factorization (NMF) survey plot of quality measures obtained from factorization rank from 2 to 6 by running the target matrix esGolub [1:200] 10 times. (B) The function of factorization rank is selected as the emergence rank of the residual sum of squares (RSS) survey. For example, the rank range of 2 to 6 is between knee points detected by the R inflection package's uik()function at 3. Overall, the method of the UIK estimation was confirmed with former results.
Figure 4
Figure 4
(A) It is complicated to locate the optimal rank with the cophenetic correlation coefficient approach. (B) However, the unit invariant knee (UIK) method can facilitate this decision more quickly and more accurately, which agrees with the number of signatures detected by Alexandrov et al [46]. RSS: residual sum of squares.

References

    1. Laurberg H. Non-negative matrix factorization: theory and methods. PhD thesis. Institut for Elektroniske Systemer, Aalborg University Denmark. 2008. [2023-05-05]. https://vbn.aau.dk/ws/portalfiles/portal/316444854/HLA_thesis.pdf .
    1. Kameoka H, Ono N, Kashino K, Sagayama S. NMF: A new sparse representation for acoustic signals. IEEE International Conference on Acoustics, Speech and Signal Processing; April 19-24, 2009; Taipei, Taiwan. 2009. - DOI
    1. Cantisani G, Essid S, Richard G. Neuro-steered music source separation with EEG-based auditory attention decoding and contrastive-NMF. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 20211; June 6-11, 2021; Toronto, ON. 2021. - DOI
    1. Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999 Oct 21;401(6755):788–791. doi: 10.1038/44565. - DOI - PubMed
    1. Ramanarayanan V, Katsamanis A, Narayanan S. Automatic data-driven learning of articulatory primitives from real-time mri data using convolutive nmf with sparseness constraints. INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association; August 27-31, 2011; Florence, Italy. 2011. - DOI

LinkOut - more resources