Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022:2:899655.
doi: 10.3389/fepid.2022.899655. Epub 2022 Sep 13.

Causal Discovery in High-dimensional, Multicollinear Datasets

Affiliations

Causal Discovery in High-dimensional, Multicollinear Datasets

Minxue Jia et al. Front Epidemiol. 2022.

Abstract

As the cost of high-throughput genomic sequencing technology declines, its application in clinical research becomes increasingly popular. The collected datasets often contain tens or hundreds of thousands of biological features that need to be mined to extract meaningful information. One area of particular interest is discovering underlying causal mechanisms of disease outcomes. Over the past few decades, causal discovery algorithms have been developed and expanded to infer such relationships. However, these algorithms suffer from the curse of dimensionality and multicollinearity. A recently introduced, non-orthogonal, general empirical Bayes approach to matrix factorization has been demonstrated to successfully infer latent factors with interpretable structures from observed variables. We hypothesize that applying this strategy to causal discovery algorithms can solve both the high dimensionality and collinearity problems, inherent to most biomedical datasets. We evaluate this strategy on simulated data and apply it to two real-world datasets. In a breast cancer dataset, we identified important survival-associated latent factors and biologically meaningful enriched pathways within factors related to important clinical features. In a SARS-CoV-2 dataset, we were able to predict whether a patient (1) had Covid-19 and (2) would enter the ICU. Furthermore, we were able to associate factors with known Covid-19 related biological pathways.

Keywords: Causal Discovery; Collinearity; Dimensionality Reduction; Empirical Bayes Matrix Factorization; Latent Factors.

PubMed Disclaimer

Conflict of interest statement

CONFLICT OF INTEREST STATEMENT The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
(A) Regulation of gene expression. (B) Causal Graph including relationship among observed features, latent factors, and target features.
Figure 2
Figure 2
Bar plots for (A) number of latent factors and (B) mean correlation coefficient (MCC); EBMF represents latent factors recovered by greedy algorithm. EBMF+BF represents latent factors recovered by greedy algorithm + Backfitting algorithm; PCA represent principal component analysis; HC/LC represents simulated data with high/low correlation; the line at 50 latent factors indicates the ground truth number of source latent factors.
Figure 3
Figure 3
Markov blankets of important clinical features from the causal network learned using MGM−PC−Max. Blue diamonds are factors learned from METABRIC gene expression using greedy backfitted EBMF. Green squares are clinical METABRIC features. For explanation of the clinical features, please see Supplementary Table 1.
Figure 4
Figure 4
Biological functions significantly associated with factors that are in the Markov blankets of tumor size: LF59, distant relapse: LF1 and LF10, ER status: LF2, PR status: LF11 and LF27.
Figure 5
Figure 5
Disease free survival based on (A) individual factors and the (B) combinations of all factors.
Figure 6
Figure 6
Combined Markov blankets of clinical features from causal network learned using FCI−Max with bootstrapping, highest ensemble, and α = 0.05. Blue diamonds are factors learned from gene expression using greedy-backfitted EBMF. Green squares are clinical features. For explanation of the clinical features, please see Supplementary Table 3.
Figure 7
Figure 7
Distributions of EBMF factor values for features in the Markov blanket of (A) disease state (COVID-19 vs. non-COVID-19), (B) ICU admittance (yes vs. no).
Figure 8
Figure 8
Biological functions significantly associated with Factors that are in the Markov blanket of disease state (LF2, LF5, LF20, LF36) and ICU Admittance (LF3, LF23).
Figure 9
Figure 9
Intersection between biological function gene sets that are significantly associated with the differentially expressed genes and LFs across (A) disease state and (B) ICU admittance.
Figure 10
Figure 10
General linear model prediction ROC using factors contained within the Markov blanket for (A) disease state and (B) ICU admittance.

Similar articles

Cited by

References

    1. Fachal L, Aschard H, Beesley J, Barnes DR, Allen J, Kar S, et al. . Fine-mapping of 150 breast cancer risk regions identifies 191 likely target genes. Nat Genet. (2020) 52:56–73. 10.1038/s41588-019-0537-1 - DOI - PMC - PubMed
    1. Sedgewick AJ, Buschur K, Shi I, Ramsey JD, Raghu VK, Manatakis DV, et al. . Mixed graphical models for integrative causal analysis with application to chronic lung disease diagnosis and prognosis. Bioinformatics. (2019) 35:1204–12. 10.1093/bioinformatics/bty769 - DOI - PMC - PubMed
    1. Roushangar R, Mias GI. Multi-study reanalysis of 2,213 acute myeloid leukemia patients reveals age-and sex-dependent gene expression signatures. Sci Rep. (2019) 9:1–7. 10.1038/s41598-019-48872-0 - DOI - PMC - PubMed
    1. Abecassis I, Sedgewick AJ, Romkes M, Buch S, Nukui T, Kapetanaki MG, et al. . PARP1 rs1805407 increases sensitivity to PARP1 inhibitors in cancer cells suggesting an improved therapeutic strategy. Sci Rep. (2019) 9:1–9. 10.1038/s41598-019-39542-2 - DOI - PMC - PubMed
    1. Buschur KL, Chikina M, Benos PV. Causal network perturbations for instance-specific analysis of single cell and disease samples. Bioinformatics. (2019) 36:2515–21. 10.1093/bioinformatics/btz949 - DOI - PMC - PubMed