Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Sep;17(3):2212-2235.
doi: 10.1214/22-aoas1715. Epub 2023 Sep 7.

Bayesian combinatorial MultiStudy factor analysis

Affiliations

Bayesian combinatorial MultiStudy factor analysis

Isabella N Grabski et al. Ann Appl Stat. 2023 Sep.

Abstract

Mutations in the BRCA1 and BRCA2 genes are known to be highly associated with breast cancer. Identifying both shared and unique transcript expression patterns in blood samples from these groups can shed insight into if and how the disease mechanisms differ among individuals by mutation status, but this is challenging in the high-dimensional setting. A recent method, Bayesian Multi-Study Factor Analysis (BMSFA), identifies latent factors common to all studies (or equivalently, groups) and latent factors specific to individual studies. However, BMSFA does not allow for factors shared by more than one but less than all studies. This is critical in our context, as we may expect some but not all signals to be shared by BRCA1-and BRCA2-mutation carriers but not necessarily other high-risk groups. We extend BMSFA by introducing a new method, Tetris, for Bayesian combinatorial multi-study factor analysis, which identifies latent factors that any combination of studies or groups can share. We model the subsets of studies that share latent factors with an Indian Buffet Process, and offer a way to summarize uncertainty in the sharing patterns using credible balls. We test our method with an extensive range of simulations, and showcase its utility not only in dimension reduction but also in covariance estimation. When applied to transcript expression data from high-risk families grouped by mutation status, Tetris reveals the features and pathways characterizing each group and the sharing patterns among them. Finally, we further extend Tetris to discover groupings of samples when group labels are not provided, which can elucidate additional structure in these data.

Keywords: Dimension Reduction; Factor Analysis; Gibbs Sampling; Multi-study Analysis; Unsupervised Learning.

PubMed Disclaimer

Figures

FIG 1.
FIG 1.
(A) Comparison of heatmaps for the true (left) and estimated (right) partially shared loading covariance (top) and common loading covariance (bottom). Results shown are based on a single dataset generated using the Scenario 1 simulation, where there are structural differences between the common and partially shared factors. (B) RV coefficients for the full loading matrix covariance and common loading matrix covariance for the Scenario 1 simulation. (C) Number of factors shared by each pair of studies i and j, indicated by (i,j), and the number of total factors belonging to study i, indicated by i. Estimated values are in black (with jitter, for visual clarity) and ground-truth values are in red.
FIG 2.
FIG 2.
RV coefficients for the full loading matrix covariance (left) and common loading matrix covariance (right) across varying sparsity and data dimension in the Scenario 2 simulations.
FIG 3.
FIG 3.
Number of factors shared by each pair of studies i and j, indicated by (i,j), and the number of total factors belonging to study i, indicated by i, for the pns ns simulations in Scenario 2, with varying sparsity and number of partially shared factors. Estimated values are in black (with jitter, for visual clarity) and ground-truth values are in red.
FIG 4.
FIG 4.
RV coefficients for study-specific covariances across varying sparsities and numbers of partially shared factors in the pns setting of Scenario 2.
FIG 5.
FIG 5.
(A) RV coefficients for full and common loading matrix covariances in the Scenario 4 simulations. (B) RV coefficients for the study-specific covariance matrices in the Scenario 4 simulations. (C) Number of factors shared by each pair of studies i and j, indicated by (i,j), and the number of total factors belonging to study i, indicated by i, for the Scenario 4 simulations. Estimated values are in black (with jitter, for visual clarity) and ground-truth values are in red.
FIG 6.
FIG 6.
Visual summary of sharing pattern (top), factor loadings (middle), and pathway analysis (bottom) for the analysis by genotype. Each column corresponds to the same factor through the three panels. Transcripts (rows) in the heatmap of loadings are clustered by their TPMs across all samples. Enrichment p-values have been remapped with the Benjamini-Hochberg method. Pathway names are abbreviated, with an identifying table in Supplementary Materials Section F.
FIG 7.
FIG 7.
Visual summary of sharing patterns (top), factor loadings (second), pathway analysis (third), and congruence coefficients with analysis by genotype (bottom) for the analysis by genotype and affected status. Each column corresponds to the same factors through the three panels. Transcripts (rows) in the heatmap of loadings are clustered by their raw counts across all samples. Enrichment p-values have been corrected with the Benjamini-Hochberg method. Pathway names are abbreviated, with an identifying table in the Supplementary Materials Section F.

References

    1. Abdi H (2007). RV coefficient and congruence coefficient. Encyclopedia of measurement and statistics 849–853.
    1. Berger JO and Pericchi LR (1996). The intrinsic Bayes factor for model selection and prediction. Journal of the American Statistical Association 91 109–122.
    1. Bhattacharya A and Dunson DB (2011). Sparse Bayesian infinite factor models. Biometrika 291–306. - PMC - PubMed
    1. Chen J and Chen Z (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95 759–771.
    1. Chipman H, George EI, McCulloch RE, Clyde M, Foster DP and Stine RA (2001). The practical implementation of Bayesian model selection. Lecture Notes-Monograph Series 65–134.

LinkOut - more resources