This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2025 Oct 16:2025.08.11.666099.

doi: 10.1101/2025.08.11.666099.

The Data Distillery: A Graph Framework for Semantic Integration and Querying of Biomedical Data

Taha Mohseni Ahooyi¹, Benjamin Stear¹, J Alan Simmons², Vincent T Metzger³, Praveen Kumar³, John Erol Evangelista⁴, Daniel J B Clarke⁴, Zhuorui Xie⁴, Heesu Kim⁴, Sherry L Jenkins⁴, Mano R Maurya⁵, Srinivasan Ramachandran⁵, Eoin Fahy⁵, Thomas H Gillespie⁶, Fahim T Imam⁶, Natallia Kokash⁷, Matthew E Roth⁸, Robert Fullem⁸, Dubravka Jevtic⁹, Aleks Mihajlovic⁹, Michael Tiemeyer¹⁰, Clara Bakker¹¹, Andrew J Schroeder¹¹, Julia Markowski¹¹, Jared Nedzel¹², Dave D Hill¹, James Terry¹, Christopher Nemarich¹³, Jyl Boline¹⁴, Peter J Park¹¹, Kristin G Ardlie¹², Jeet Vora¹⁵, Raja Mazumder¹⁵, Rene Ranzinger¹⁰, Bernard de Bono¹⁶, Shankar Subramaniam⁵, Jeffrey S Grethe⁶, Jeremy J Yang³, Christophe G Lambert³, Adam Resnick^{13

17}, Aleks Milosavljevic⁸, Avi Ma'ayan⁴, Jonathan C Silverstein², Deanne M Taylor^{1

17}

Affiliations

¹ Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia, Philadelphia PA USA.
² Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh PA USA.
³ Department of Internal Medicine, Division of Translational Informatics, University of New Mexico Health Sciences Center, University of New Mexico NM USA.
⁴ Department of Pharmacological Sciences; Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York NY USA.
⁵ Department of Bioengineering, University of California San Diego, San Diego CA USA.
⁶ Department of Neuroscience, School of Medicine, University of California San Diego, San Diego CA USA.
⁷ Institute of Informatics, University of Amsterdam, the Netherlands.
⁸ Department of Molecular and Human Genetics, Baylor College of Medicine, Houston TX USA.
⁹ Persida Bio, Brooklyn NY USA.
¹⁰ Complex Carbohydrate Research Center, University of Georgia, Athens, Georgia, USA.
¹¹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
¹² Broad Institute of MIT and Harvard, Cambridge MA USA.
¹³ Center for Data Driven Discovery, The Children's Hospital of Philadelphia, Philadelphia PA USA.
¹⁴ Informed Minds Inc. Walnut Creek, CA USA.
¹⁵ Department of Biochemistry and Molecular Medicine, George Washington University, Washington DC USA.
¹⁶ Auckland Bioengineering Institute, University of Auckland, Auckland NZ.
¹⁷ Department of Pediatrics, University of Pennsylvania Perelman School of Medicine, Philadelphia PA USA.

PMID: 40832351
PMCID: PMC12363844
DOI: 10.1101/2025.08.11.666099

The Data Distillery: A Graph Framework for Semantic Integration and Querying of Biomedical Data

Taha Mohseni Ahooyi et al. bioRxiv. 2025.

[Preprint]. 2025 Oct 16:2025.08.11.666099.

doi: 10.1101/2025.08.11.666099.

Authors

Affiliations

¹ Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia, Philadelphia PA USA.
² Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh PA USA.
³ Department of Internal Medicine, Division of Translational Informatics, University of New Mexico Health Sciences Center, University of New Mexico NM USA.
⁴ Department of Pharmacological Sciences; Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York NY USA.
⁵ Department of Bioengineering, University of California San Diego, San Diego CA USA.
⁶ Department of Neuroscience, School of Medicine, University of California San Diego, San Diego CA USA.
⁷ Institute of Informatics, University of Amsterdam, the Netherlands.
⁸ Department of Molecular and Human Genetics, Baylor College of Medicine, Houston TX USA.
⁹ Persida Bio, Brooklyn NY USA.
¹⁰ Complex Carbohydrate Research Center, University of Georgia, Athens, Georgia, USA.
¹¹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
¹² Broad Institute of MIT and Harvard, Cambridge MA USA.
¹³ Center for Data Driven Discovery, The Children's Hospital of Philadelphia, Philadelphia PA USA.
¹⁴ Informed Minds Inc. Walnut Creek, CA USA.
¹⁵ Department of Biochemistry and Molecular Medicine, George Washington University, Washington DC USA.
¹⁶ Auckland Bioengineering Institute, University of Auckland, Auckland NZ.
¹⁷ Department of Pediatrics, University of Pennsylvania Perelman School of Medicine, Philadelphia PA USA.

PMID: 40832351
PMCID: PMC12363844
DOI: 10.1101/2025.08.11.666099

Abstract

The Data Distillery Knowledge Graph (DDKG) is a framework for semantic integration and querying of biomedical data across domains. Built for the NIH Common Fund Data Ecosystem, it supports translational research by linking clinical and experimental datasets in a unified graph model. Clinical standards such as ICD-10, SNOMED, and DrugBank are integrated through UMLS, while genomics and basic science data are structured using ontologies and standards such as HPO, GENCODE, Ensembl, STRING, and ClinVar. The DDKG uses a property graph architecture based on the UBKG infrastructure and supports ontology-based ingestion, identifier normalization, and graph-native querying. The system is modular and can be extended with new datasets or schema modules. We demonstrate its utility for informatics queries across eight use cases, including regulatory variant analysis, tissue-specific expression, biomarker discovery, and cross-species variant prioritization. The DDKG is accessible via a public interface, a programmatic API, and downloadable builds for local use.

PubMed Disclaimer

Figures

**Figure 1.. Overview of data sources integrated into the DDKG with the Unified Biomedical Knowledge Graph (UBKG).**
**(A)** The DDKG harmonizes diverse biomedical knowledge by incorporating over 180 vocabularies and data sources across clinical, genomic, and ontological domains from the UBKG. This includes: (left) the Unified Medical Language System (UMLS) with over 100 English-language terminologies; (center top) ontologies and controlled vocabularies from BioPortal, OBO Foundry, and GENCODE. **(B)** The DDKG extends the UBKG with data from annotation databases and datasets from NIH Common Fund Data Coordination Centers (DCCs), such as GTEx, HuBMAP, LINCS, Kids First, and others. These sources are harmonized through a property graph schema that supports semantic integration and query-driven exploration in the DDKG.

**Figure 2:. Integration of chromatin loop and eQTL data using the DDKG.**
A single graph-native query integrates 3D chromatin conformation data from 4DN with GTEx eQTLs, enabling spatial analysis of regulatory variants without external preprocessing. **(A)** Distribution of chromatin loop sizes across 12 4DN datasets. **(B)** Example loop modeled in the DDKG with upstream and downstream anchors (red dashed circles) and an overlapping GTEx eQTL (bottom right). **(C)** Total GTEx eQTLs found in loop segments (upstream, intra-anchor, downstream). **(D)** Normalized frequency of eQTLs per segment, adjusted for segment length, showing enrichment at anchors.

**Figure 3.**
**(A)** An example graph view showing the connection between glycoenzyme genes and human tissues through the GTEx expressions of the genes encoding glycosylation-associated genes provided by Glygen data (**Query 4**). **(B)** Heatmap visualization of tissue-wide expressions (log10 of the mean TPMs) of 46 major glycoenzyme-encoding genes illustrates tissue- and organ-specific variations.

**Figure 4:. Joint querying assertions defined by Metabolomics Workbench (MW) and IDG.**
**(A)** An example graph view of output depicting the cross-walk between a human gene (MAOB), a metabolite, condition (Ulcerative Colitis) and tissue (Colon Structure) and the same gene protein product modulation by a bioactive compound (see Online Methods, **Query 8**). Top 10 **(B)** conditions, **(C)** genes, **(D)** metabolites and **(E)** conditions based on frequency in 450,000 output instances as a result of the query.

**Figure 5:**
STRING enrichment analysis from Gene Ontology Biological Process for 27 genes identified from DDKG-based overlap between CHD cohort variants and IMPC phenotype-matched orthologs (**Method 7**). Results show significant overrepresentation in the 27 genes for developmental and signaling processes including different categories of cardiac development. FDR-adjusted p-values and gene set sizes are visualized using bar length and circle size, respectively. These results independently validate the biological relevance of DDKG-derived gene prioritizations in a cross-species phenotype-matching context.

**Figure 6.. Visualization of gene-phenotype-disease associations using the DDKG-UI.**
The DDKG-UI, an interactive web-based platform derived from the Data Distillery Knowledge Graph (DDKG), enables users to explore complex biological relationships through customizable queries. (A) A query for the GFAP gene returns its expression across multiple tissues, anatomical abnormalities, and functional annotations. The graph illustrates associations with GTEx expression data, ENCODE regulatory elements, MSIGDB gene sets, and disease-related ontologies. (B) A disease-focused query links GFAP to vascular dementia, showing intermediary relationships with other genes (e.g., HTRA1 and NOTCH3) and relevant phenotypes, including abnormal myelination and muscle physiology alterations. The DDKG-UI allows researchers to dynamically search, filter, and visualize multi-omic datasets, facilitating hypothesis generation and biomedical discovery.

**Figure 7:. Example output from Query 2, illustrating a targeted liquid biopsy workflow for Frontotemporal Dementia (HP:0002145) by leveraging exRNA detection in saliva.**
The visualization highlights key molecular interactions between disease-associated genes, RNA-binding proteins (RBPs), and extracellular RNA (exRNA) expression patterns across relevant biofluids. The DDKG query framework identifies genes linked to Frontotemporal Dementia, the biofluids where their exRNA is detected, and the RBPs predicted to interact with those exRNAs, enabling insights into disease biomarker discovery. This structured graph-based approach facilitates hypothesis generation, in this case for non-invasive biomarker detection in neurodegenerative disorders.

**Figure 8.. Liquid Biopsy Analysis. Approach for monitoring drug response to Astemizole, focusing on ALCAM expression and PTBP1 detection in cerebrospinal fluid.**
This subgraph is a result from the Query 3 path connecting the antihistamine Astemizole (PUBCHEM:2247) to its transcriptomic targets, specifically highlighting the ALCAM gene. ALCAM is positively regulated in LINCS data following Astemizole perturbation. The ALCAM locus overlaps an exRNA region bound by the RNA-binding protein PTBP1. PTBP1 is computationally predicted to be present in cerebrospinal fluid (CSF) and interacts with the overlapping exRNA locus, making it accessible via CSF pulldown. This example demonstrates how the DDKG enables integration of pharmacogenomic, regulatory, and tissue-localization data to support hypothesis generation for drug monitoring via targeted liquid biopsy.

**Figure 9:**
Combining data from IDG and LINCS, shown are the results for Query 5 seeking genes and Pubchem compounds associated with the disease “Asthma.”

**Figure 10:. Cypher query results for genes containing the string “ALOX” (Query 6).**
This query returns human lipoxygenase genes along with Pubchem compounds associated via the bioactivity relationship.

**Figure 11:. Graph-based representation of ALOX5 compound activity and expression profile.**
This example subgraph is one result from **Query 7**. It shows the connection from a PubChem compound with known bioactivity (from IDG) to its UniProt gene product, the HGNC-coded gene ALOX5. GTEx expression data is linked through expression bins and mapped to UBERON tissues (here Peyer’s patch). The result demonstrates the DDKG’s integration of chemical, molecular, and anatomical data, enabling rapid exploration of tissue-specific gene–compound interactions.

**Figure 12:**
**Query 9** aims to identify data points that are linked to evidence related to genes in tissues across different omes in the MoTrPAC young adult rats endurance training exercise data that match GTEx eQTL regulation in heart. The resulting gene set are exercise-linked genes with tissue expression matches to humans that are linked to disease.

**Figure 13:. Cross-species integration of mouse phenotypes with human genomic data using the DDKG.**
Mouse phenotypes associated with atrial septal defects (left) are linked to IMPC-derived gene knockouts, then mapped to human orthologs via HCOP. These human genes are associated with pathogenic variants in the Kids First congenital heart defect cohort (right). This multi-ontology traversal demonstrates how model organism data can inform human disease gene discovery through knowledge graph integration.

**Figure 14.. STRING gene enrichment results on the Human Phenotype Ontology for the top 200 genes ranked by DDKG graph-based proximity to congenital heart defects and pediatric leukemias (Method 8).**
Genes prioritized using the Common Neighbors algorithm (**Query 11**) were analyzed using the STRING v12.0 functional annotation tool. The plot shows statistically significant signals for Leukemia and Atrial Septal Defects within the Human Phenotype Ontology (HPO) category.

**Figure 15.. Converting UBKG assertions into DDKG-UI assertions.**
The DDKG has limited properties on the Concept nodes. In this figure we show an example of a transformed node by extracting Code and Term information into properties on the concept nodes in the DDKG-UI subgraph.

See this image and copyright information in PMC

References

1. Ma’ayan A. et al. Lean Big Data integration in systems biology and systems pharmacology. Trends Pharmacol Sci 35, 450–460 (2014). - PMC - PubMed
1. Barabási A.-L. & Oltvai Z. N. Network biology: understanding the cell’s functional organization. Nat. Rev. Genet. 5, 101–113 (2004). - PubMed
1. Nicholson D. N. & Greene C. S. Constructing knowledge graphs and their biomedical applications. Comput. Struct. Biotechnol. J. 18, 1414–1428 (2020). - PMC - PubMed
1. Alshahrani M., Thafar M. A. & Essack M. Application and evaluation of knowledge graph embeddings in biomedical data. PeerJ Comput Sci 7, e341 (2021).
1. Alshahrani M. & Hoehndorf R. Drug repurposing through joint learning on knowledge graphs and literature. bioRxiv 385617 (2018) doi: 10.1101/385617. - DOI

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

The Data Distillery: A Graph Framework for Semantic Integration and Querying of Biomedical Data

Affiliations

The Data Distillery: A Graph Framework for Semantic Integration and Querying of Biomedical Data

Authors

Affiliations

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources