Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun:142:104368.
doi: 10.1016/j.jbi.2023.104368. Epub 2023 Apr 21.

Causal feature selection using a knowledge graph combining structured knowledge from the biomedical literature and ontologies: A use case studying depression as a risk factor for Alzheimer's disease

Affiliations

Causal feature selection using a knowledge graph combining structured knowledge from the biomedical literature and ontologies: A use case studying depression as a risk factor for Alzheimer's disease

Scott A Malec et al. J Biomed Inform. 2023 Jun.

Abstract

Background: Causal feature selection is essential for estimating effects from observational data. Identifying confounders is a crucial step in this process. Traditionally, researchers employ content-matter expertise and literature review to identify confounders. Uncontrolled confounding from unidentified confounders threatens validity, conditioning on intermediate variables (mediators) weakens estimates, and conditioning on common effects (colliders) induces bias. Additionally, without special treatment, erroneous conditioning on variables combining roles introduces bias. However, the vast literature is growing exponentially, making it infeasible to assimilate this knowledge. To address these challenges, we introduce a novel knowledge graph (KG) application enabling causal feature selection by combining computable literature-derived knowledge with biomedical ontologies. We present a use case of our approach specifying a causal model for estimating the total causal effect of depression on the risk of developing Alzheimer's disease (AD) from observational data.

Methods: We extracted computable knowledge from a literature corpus using three machine reading systems and inferred missing knowledge using logical closure operations. Using a KG framework, we mapped the output to target terminologies and combined it with ontology-grounded resources. We translated epidemiological definitions of confounder, collider, and mediator into queries for searching the KG and summarized the roles played by the identified variables. We compared the results with output from a complementary method and published observational studies and examined a selection of confounding and combined role variables in-depth.

Results: Our search identified 128 confounders, including 58 phenotypes, 47 drugs, 35 genes, 23 collider, and 16 mediator phenotypes. However, only 31 of the 58 confounder phenotypes were found to behave exclusively as confounders, while the remaining 27 phenotypes played other roles. Obstructive sleep apnea emerged as a potential novel confounder for depression and AD. Anemia exemplified a variable playing combined roles.

Conclusion: Our findings suggest combining machine reading and KG could augment human expertise for causal feature selection. However, the complexity of causal feature selection for depression with AD highlights the need for standardized field-specific databases of causal variables. Further work is needed to optimize KG search and transform the output for human consumption.

Keywords: Alzheimer’s disease; Causal modeling; Depression; Feature selection; Knowledge graphs; Knowledge representation, management, or engineering.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

Figure 1.
Figure 1.
This causal diagram depicts exposure (A) and outcome (Y) in yellow and orange, respectively, along with a confounder denoted “X” with a green background, a mediator variable denoted “M” with a gray background, and a collider “C” also with a gray background. The green background for the confounder means that it is desirable to condition on such variables. Gray indicates that conditioning on that type of variable induces bias.
Figure 2.
Figure 2.
The principal stages of the workflow to construct the causal feature selection system. I. Scope the literature to clinical studies investigating AD, published in 2010 or after. II. Obtain triples from the machine reading systems. III. Post-process the knowledge by performing logical closure operations with the CLIPS production rule system, to infer missing edges and terminology mapping. IV. Construct the KG using the PheKnowLator platform, merging the output of the machine reading systems with ontology-grounded information. V. Search the KG to identify relevant variables. VI. Compile the search results into a causal model analyzing KG search result output and comparing that output with the structured knowledge in SemMedDB, the results of a pilot meta-review of reported confounders collected from observational studies, and PubMed.
Figure 3.
Figure 3.
This figure shows two versions of a SPARQL query’s (partial) output. (a.) shows sample raw output from the SPARQL query, and (b.) shows a visualization we manually created from the output. Note that the visualization displays hierarchical (IS_A), mereological (from the Greek μερος, ‘part’), e.g., PART_OF, HAS_PART, relationships vertically, and causal relationships (INTERACTS_WITH, CAUSES_OR_CONTRIBUTES_TO_CONDITION) vertically.
Figure 4.
Figure 4.
This figure illustrates merging the outputs from the machine reading systems, performing knowledge hygiene, and applying logical closure operations on the extraction graph.
Figure 5.
Figure 5.
Sample reasoning pathway of inflammatory response, a confounder identified by KG search. Note that the y-axis is the TYPE_OF/subsumption axis. The x-axis is the CAUSES/INTERACTS_WITH axis. SOD2 is superoxide dismutase 2, a gene that encodes a mitochondrial protein. SOD2 is associated with diabetes and gastric cancer. SOD2 polymorphisms are associated with neurodegenerative diseases, including cognitive decline, stroke, and AD,. SOD2 is a gene that scavenges for reactive oxygen species (ROS) resulting from environmental exposures, e.g., bisphenol A, that result in inflammation. Bisphenol A is found ubiquitously in plastics and is ingested from water bottles, packaged food, and many other sources.
Figure 6.
Figure 6.
Sample reasoning pathway of malnutrition, a collider identified by KG search. The MTHFR gene is a gene that encodes the Methylenetetrahydrofolate reductase (MTHFR) enzyme that is critical for the metabolism of amino acids, including homocysteine. MTHFR perturbations and mutations decrease the metabolism and elevate homocysteine. Elevated homocysteine is linked with blood clots and thrombosis events. Homocysteine has been reported in the published literature as a confounder of depression and AD in at least one observational study. Elevated homocysteine is also associated with low levels of vitamins B6 and B12 and folate.
Figure 7.
Figure 7.
Sample reasoning pathway of Parkinsonian disorders, a potential mediator for AD identified by KG search. Prodynorphin (PDYN) is a gene that produces endogenous opioid peptides involved with motor control and movement implicated in neurodegenerative diseases that modulate response to psychoactive substances, including cocaine and ethanol. The gene insulin growth factor 2 (IGF2) is associated with development and growth. Perfluorooctanoic acid and bisphenol A exposure is associated with neurodegeneration resulting in AD. Hippocampal IGF2 expression is decreased in AD-diagnosed patients. Increased IGF2 expression improves cognition in AD mice models.
Figure 8.
Figure 8.
This diagram shows the variables identified using their causal roles relating to depression and AD by translating standard epidemiological definitions into patterns for querying the KG.
Figure 9.
Figure 9.
This diagram shows variables according to their roles, including single and hybrid role types for the relationship between depression (exposure) and AD (outcome). Each rectangular box in the figure lists all the conditions by single role (confounder, collider, or mediator) or combination roles (e.g., mediator/collider). For example, the box labeled “Confounders only” contains covariates that are exclusively confounders and were not found to act as a collider or a mediator.
Figure 10.
Figure 10.
This figure shows (main) condition variables/concepts for simple (not combined) causal roles from KG search in yellow, SemMedDBScoped in green diagonal and SemMedDBComplete in rose polka dot, and the confounders from the pilot-metareview in solid blue as a Venn diagram. See Appendix B. Figure S1. for the complete results, containing genes/enzymes/proteins and drugs/substances.
Figure 11.
Figure 11.
This figure shows the (mainly) condition variables/concepts for combination causal roles from KG search in yellow, SemMedDBComplete in pink poka dot, and the confounders from the pilot-metareview in solid blue as a Venn diagram. Sec Appendix B. Figure S2. for the complete results, containing genes/enzymes/proteins and drugs/substances.
Figure 12.
Figure 12.
This figure shows the reasoning path support for depression causes anemia in (a.) and reasoning path support for AD causes anemia in (b.), and shows the mediating role of IGF2 and AGT in the relationship between depression, AD, and anemia. Reasoning path support in the KG for anemia as a common effect or collider for depression and AD. Both reasoning paths on the left (a.) and (b.) show how IGF2 mediates between depression and environmental exposures (e.g., perfluorooctanoic acid and bisphenol A) on the one hand and anemia on the other. Angiotensinogen (AGT) is the name of a gene and enzyme coded by that gene involved in blood pressure and maintaining fluid and electrolyte homeostasis. Both reasoning paths on the right (a.) and (b.) show how AGT mediates between AD and the drugs atenolol (a beta-blocker) and benazepril and enalapril are angiotensin-converting enzyme inhibitors, or (ACEIs), on the one hand, and anemia on the other. These drugs are known to cause anemia. Numerous studies indicate AD as a potential cause of anemia,.
Figure 13.
Figure 13.
Support in the KG for anemia as a mediator for depression and AD. Note that (a.) contains the same information as Figure 12 (a.). Vascular endothelial growth factor A (VEGFA) promotes the proliferation and migration of vascular endothelial cells and is upregulated in tumorogenesis. VEGFA is also upregulated in AD.

Similar articles

Cited by

References

    1. Cartwright N.Are RCTs the Gold Standard? BioSocieties [Internet]. 2007. Mar [cited 2017 Jul 21];2(1):11–20. Available from: http://www.palgrave-journals.com/doifinder/10.1017/S1745855207005029 - DOI
    1. VanderWeele TJ, Shpitser I. On the definition of a confounder. Ann Stat. 2013. Feb;41(1):196–220. - PMC - PubMed
    1. VanderWeele TJ. Principles of confounder selection. Eur J Epidemiol [Internet]. 2019. [cited 2019 Aug 20];34(3):211–9. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6447501/ - PMC - PubMed
    1. Arntzenius F.Reichenbach’s Common Cause Principle. In: Zalta EN, editor. The Stanford Encyclopedia of Philosophy [Internet]. Fall 2010. Metaphysics Research Lab, Stanford University; 2010 [cited 2019 Dec 10]. p. 1. Available from: https://plato.stanford.edu/archives/fall2010/entries/physics-Rpcc/
    1. VanderWeele TJ, Shpitser I. A new criterion for confounder selection. Biometrics [Internet]. 2011. Dec;67(4):1406–13. Available from: https://www.ncbi.nlm.nih.gov/pubmed/21627630 - PMC - PubMed

Publication types