Causal Discovery in High-dimensional, Multicollinear Datasets
- PMID: 36778756
- PMCID: PMC9910507
- DOI: 10.3389/fepid.2022.899655
Causal Discovery in High-dimensional, Multicollinear Datasets
Abstract
As the cost of high-throughput genomic sequencing technology declines, its application in clinical research becomes increasingly popular. The collected datasets often contain tens or hundreds of thousands of biological features that need to be mined to extract meaningful information. One area of particular interest is discovering underlying causal mechanisms of disease outcomes. Over the past few decades, causal discovery algorithms have been developed and expanded to infer such relationships. However, these algorithms suffer from the curse of dimensionality and multicollinearity. A recently introduced, non-orthogonal, general empirical Bayes approach to matrix factorization has been demonstrated to successfully infer latent factors with interpretable structures from observed variables. We hypothesize that applying this strategy to causal discovery algorithms can solve both the high dimensionality and collinearity problems, inherent to most biomedical datasets. We evaluate this strategy on simulated data and apply it to two real-world datasets. In a breast cancer dataset, we identified important survival-associated latent factors and biologically meaningful enriched pathways within factors related to important clinical features. In a SARS-CoV-2 dataset, we were able to predict whether a patient (1) had Covid-19 and (2) would enter the ICU. Furthermore, we were able to associate factors with known Covid-19 related biological pathways.
Keywords: Causal Discovery; Collinearity; Dimensionality Reduction; Empirical Bayes Matrix Factorization; Latent Factors.
Conflict of interest statement
CONFLICT OF INTEREST STATEMENT The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Figures










Similar articles
-
An algorithm for direct causal learning of influences on patient outcomes.Artif Intell Med. 2017 Jan;75:1-15. doi: 10.1016/j.artmed.2016.10.003. Epub 2016 Nov 5. Artif Intell Med. 2017. PMID: 28363452 Free PMC article.
-
Essential Regression: A generalizable framework for inferring causal latent factors from multi-omic datasets.Patterns (N Y). 2022 Mar 24;3(5):100473. doi: 10.1016/j.patter.2022.100473. eCollection 2022 May 13. Patterns (N Y). 2022. PMID: 35607614 Free PMC article.
-
Exploring matrix factorization techniques for significant genes identification of Alzheimer's disease microarray gene expression data.BMC Bioinformatics. 2011;12 Suppl 5(Suppl 5):S7. doi: 10.1186/1471-2105-12-S5-S7. Epub 2011 Jul 27. BMC Bioinformatics. 2011. PMID: 21989140 Free PMC article.
-
Benchmarking of Machine Learning classifiers on plasma proteomic for COVID-19 severity prediction through interpretable artificial intelligence.Artif Intell Med. 2023 Mar;137:102490. doi: 10.1016/j.artmed.2023.102490. Epub 2023 Jan 18. Artif Intell Med. 2023. PMID: 36868685 Free PMC article. Review.
-
Matrix factorization for biomedical link prediction and scRNA-seq data imputation: an empirical survey.Brief Bioinform. 2022 Jan 17;23(1):bbab479. doi: 10.1093/bib/bbab479. Brief Bioinform. 2022. PMID: 34864871 Review.
Cited by
-
Streamlining NMR Chemical Shift Predictions for Intrinsically Disordered Proteins: Design of Ensembles with Dimensionality Reduction and Clustering.J Chem Inf Model. 2024 Aug 26;64(16):6542-6556. doi: 10.1021/acs.jcim.4c00809. Epub 2024 Aug 5. J Chem Inf Model. 2024. PMID: 39099394 Free PMC article.
References
Grants and funding
LinkOut - more resources
Full Text Sources
Research Materials
Miscellaneous