Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024;17(2):219-230.
doi: 10.4310/23-sii790. Epub 2024 Feb 1.

Multi-way overlapping clustering by Bayesian tensor decomposition

Affiliations

Multi-way overlapping clustering by Bayesian tensor decomposition

Zhuofan Wang et al. Stat Interface. 2024.

Abstract

The development of modern sequencing technologies provides great opportunities to measure gene expression of multiple tissues from different individuals. The three-way variation across genes, tissues, and individuals makes statistical inference a challenging task. In this paper, we propose a Bayesian multi-way clustering approach to cluster genes, tissues, and individuals simultaneously. The proposed model adaptively trichotomizes the observed data into three latent categories and uses a Bayesian hierarchical construction to further decompose the latent variables into lower-dimensional features, which can be interpreted as overlapping clusters. With a Bayesian nonparametric prior, i.e., the Indian buffet process, our method determines the cluster number automatically. The utility of our approach is demonstrated through simulation studies and an application to the Genotype-Tissue Expression (GTEx) RNA-seq data. The clustering result reveals some interesting findings about depression-related genes in human brain, which are also consistent with biological domain knowledge. The detailed algorithm and some numerical results are available in the online Supplementary Material, http://intlpress.com/site/pub/files/-supp/sii/2024/0017/0002/sii-2024-0017-0002-s001.pdf.

Keywords: Bayesian nonparametric prior; Gene expression data; Indian buffet process; Low-rank tensor; Mixture model; Primary 62H30; secondary 62F15.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The average simulation results based on 50 replicates in the overlapped case for BayMC. The green, black, and red cells represent 1, 0, and −1, respectively. The first row: plots of the true values of C1,C2, and C3. The second row: plots of the average estimations of C1,C2, and C3 with the complete observations. The third row: plots of the average estimations of C1,C2, and C3 with 50% proportion of missing observations.
Figure 2.
Figure 2.
The trace plot of the number of clusters for the proposed BayMC method.
Figure 3.
Figure 3.
From left to right are the estimated membership matrices of tissues and genes using BayMC. The green, black, and red cells represent 1,0, and −1, respectively. The membership matrix of donors is presented in Section S.3 of the Supplementary Material.
Figure 4.
Figure 4.
From left to right are the estimated membership matrices of tissues and genes using HLloyd. The green and black cells represent 1 and 0, respectively.
Figure 5.
Figure 5.
From left to right are the estimated membership matrices of tissues and genes using MultiCluster. The green and black cells represent 1 and 0, respectively.

Similar articles

References

    1. Aldred EM (2009). Pharmacology: A handbook for complementary healthcare professionals. Elsevier, Amsterdam, Netherlands.
    1. Banerjee A, Krumpelman C, Ghosh J, Basu S and Mooney RJ (2005). Model-based overlapping clustering. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining 532–537.
    1. Beal S (1976). Fisher’s hypergeometric test for a comparison in a finite population. The American Statistician 30 165–168.
    1. Beck AT and Greenberg RL (1979). Coping with depression. Institute for Rational Living, New York.
    1. Bergmann S, Ihmels J and Barkai N (2003). Iterative signature algorithm for the analysis of large-scale gene expression data. Physical Review E 67 031902. - PubMed

LinkOut - more resources