Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Oct 16:2025.08.11.666099.
doi: 10.1101/2025.08.11.666099.

The Data Distillery: A Graph Framework for Semantic Integration and Querying of Biomedical Data

Affiliations

The Data Distillery: A Graph Framework for Semantic Integration and Querying of Biomedical Data

Taha Mohseni Ahooyi et al. bioRxiv. .

Abstract

The Data Distillery Knowledge Graph (DDKG) is a framework for semantic integration and querying of biomedical data across domains. Built for the NIH Common Fund Data Ecosystem, it supports translational research by linking clinical and experimental datasets in a unified graph model. Clinical standards such as ICD-10, SNOMED, and DrugBank are integrated through UMLS, while genomics and basic science data are structured using ontologies and standards such as HPO, GENCODE, Ensembl, STRING, and ClinVar. The DDKG uses a property graph architecture based on the UBKG infrastructure and supports ontology-based ingestion, identifier normalization, and graph-native querying. The system is modular and can be extended with new datasets or schema modules. We demonstrate its utility for informatics queries across eight use cases, including regulatory variant analysis, tissue-specific expression, biomarker discovery, and cross-species variant prioritization. The DDKG is accessible via a public interface, a programmatic API, and downloadable builds for local use.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. Overview of data sources integrated into the DDKG with the Unified Biomedical Knowledge Graph (UBKG).
(A) The DDKG harmonizes diverse biomedical knowledge by incorporating over 180 vocabularies and data sources across clinical, genomic, and ontological domains from the UBKG. This includes: (left) the Unified Medical Language System (UMLS) with over 100 English-language terminologies; (center top) ontologies and controlled vocabularies from BioPortal, OBO Foundry, and GENCODE. (B) The DDKG extends the UBKG with data from annotation databases and datasets from NIH Common Fund Data Coordination Centers (DCCs), such as GTEx, HuBMAP, LINCS, Kids First, and others. These sources are harmonized through a property graph schema that supports semantic integration and query-driven exploration in the DDKG.
Figure 2:
Figure 2:. Integration of chromatin loop and eQTL data using the DDKG.
A single graph-native query integrates 3D chromatin conformation data from 4DN with GTEx eQTLs, enabling spatial analysis of regulatory variants without external preprocessing. (A) Distribution of chromatin loop sizes across 12 4DN datasets. (B) Example loop modeled in the DDKG with upstream and downstream anchors (red dashed circles) and an overlapping GTEx eQTL (bottom right). (C) Total GTEx eQTLs found in loop segments (upstream, intra-anchor, downstream). (D) Normalized frequency of eQTLs per segment, adjusted for segment length, showing enrichment at anchors.
Figure 3.
Figure 3.
(A) An example graph view showing the connection between glycoenzyme genes and human tissues through the GTEx expressions of the genes encoding glycosylation-associated genes provided by Glygen data (Query 4). (B) Heatmap visualization of tissue-wide expressions (log10 of the mean TPMs) of 46 major glycoenzyme-encoding genes illustrates tissue- and organ-specific variations.
Figure 4:
Figure 4:. Joint querying assertions defined by Metabolomics Workbench (MW) and IDG.
(A) An example graph view of output depicting the cross-walk between a human gene (MAOB), a metabolite, condition (Ulcerative Colitis) and tissue (Colon Structure) and the same gene protein product modulation by a bioactive compound (see Online Methods, Query 8). Top 10 (B) conditions, (C) genes, (D) metabolites and (E) conditions based on frequency in 450,000 output instances as a result of the query.
Figure 5:
Figure 5:
STRING enrichment analysis from Gene Ontology Biological Process for 27 genes identified from DDKG-based overlap between CHD cohort variants and IMPC phenotype-matched orthologs (Method 7). Results show significant overrepresentation in the 27 genes for developmental and signaling processes including different categories of cardiac development. FDR-adjusted p-values and gene set sizes are visualized using bar length and circle size, respectively. These results independently validate the biological relevance of DDKG-derived gene prioritizations in a cross-species phenotype-matching context.
Figure 6.
Figure 6.. Visualization of gene-phenotype-disease associations using the DDKG-UI.
The DDKG-UI, an interactive web-based platform derived from the Data Distillery Knowledge Graph (DDKG), enables users to explore complex biological relationships through customizable queries. (A) A query for the GFAP gene returns its expression across multiple tissues, anatomical abnormalities, and functional annotations. The graph illustrates associations with GTEx expression data, ENCODE regulatory elements, MSIGDB gene sets, and disease-related ontologies. (B) A disease-focused query links GFAP to vascular dementia, showing intermediary relationships with other genes (e.g., HTRA1 and NOTCH3) and relevant phenotypes, including abnormal myelination and muscle physiology alterations. The DDKG-UI allows researchers to dynamically search, filter, and visualize multi-omic datasets, facilitating hypothesis generation and biomedical discovery.
Figure 7:
Figure 7:. Example output from Query 2, illustrating a targeted liquid biopsy workflow for Frontotemporal Dementia (HP:0002145) by leveraging exRNA detection in saliva.
The visualization highlights key molecular interactions between disease-associated genes, RNA-binding proteins (RBPs), and extracellular RNA (exRNA) expression patterns across relevant biofluids. The DDKG query framework identifies genes linked to Frontotemporal Dementia, the biofluids where their exRNA is detected, and the RBPs predicted to interact with those exRNAs, enabling insights into disease biomarker discovery. This structured graph-based approach facilitates hypothesis generation, in this case for non-invasive biomarker detection in neurodegenerative disorders.
Figure 8.
Figure 8.. Liquid Biopsy Analysis. Approach for monitoring drug response to Astemizole, focusing on ALCAM expression and PTBP1 detection in cerebrospinal fluid.
This subgraph is a result from the Query 3 path connecting the antihistamine Astemizole (PUBCHEM:2247) to its transcriptomic targets, specifically highlighting the ALCAM gene. ALCAM is positively regulated in LINCS data following Astemizole perturbation. The ALCAM locus overlaps an exRNA region bound by the RNA-binding protein PTBP1. PTBP1 is computationally predicted to be present in cerebrospinal fluid (CSF) and interacts with the overlapping exRNA locus, making it accessible via CSF pulldown. This example demonstrates how the DDKG enables integration of pharmacogenomic, regulatory, and tissue-localization data to support hypothesis generation for drug monitoring via targeted liquid biopsy.
Figure 9:
Figure 9:
Combining data from IDG and LINCS, shown are the results for Query 5 seeking genes and Pubchem compounds associated with the disease “Asthma.”
Figure 10:
Figure 10:. Cypher query results for genes containing the string “ALOX” (Query 6).
This query returns human lipoxygenase genes along with Pubchem compounds associated via the bioactivity relationship.
Figure 11:
Figure 11:. Graph-based representation of ALOX5 compound activity and expression profile.
This example subgraph is one result from Query 7. It shows the connection from a PubChem compound with known bioactivity (from IDG) to its UniProt gene product, the HGNC-coded gene ALOX5. GTEx expression data is linked through expression bins and mapped to UBERON tissues (here Peyer’s patch). The result demonstrates the DDKG’s integration of chemical, molecular, and anatomical data, enabling rapid exploration of tissue-specific gene–compound interactions.
Figure 12:
Figure 12:
Query 9 aims to identify data points that are linked to evidence related to genes in tissues across different omes in the MoTrPAC young adult rats endurance training exercise data that match GTEx eQTL regulation in heart. The resulting gene set are exercise-linked genes with tissue expression matches to humans that are linked to disease.
Figure 13:
Figure 13:. Cross-species integration of mouse phenotypes with human genomic data using the DDKG.
Mouse phenotypes associated with atrial septal defects (left) are linked to IMPC-derived gene knockouts, then mapped to human orthologs via HCOP. These human genes are associated with pathogenic variants in the Kids First congenital heart defect cohort (right). This multi-ontology traversal demonstrates how model organism data can inform human disease gene discovery through knowledge graph integration.
Figure 14.
Figure 14.. STRING gene enrichment results on the Human Phenotype Ontology for the top 200 genes ranked by DDKG graph-based proximity to congenital heart defects and pediatric leukemias (Method 8).
Genes prioritized using the Common Neighbors algorithm (Query 11) were analyzed using the STRING v12.0 functional annotation tool. The plot shows statistically significant signals for Leukemia and Atrial Septal Defects within the Human Phenotype Ontology (HPO) category.
Figure 15.
Figure 15.. Converting UBKG assertions into DDKG-UI assertions.
The DDKG has limited properties on the Concept nodes. In this figure we show an example of a transformed node by extracting Code and Term information into properties on the concept nodes in the DDKG-UI subgraph.

References

    1. Ma’ayan A. et al. Lean Big Data integration in systems biology and systems pharmacology. Trends Pharmacol Sci 35, 450–460 (2014). - PMC - PubMed
    1. Barabási A.-L. & Oltvai Z. N. Network biology: understanding the cell’s functional organization. Nat. Rev. Genet. 5, 101–113 (2004). - PubMed
    1. Nicholson D. N. & Greene C. S. Constructing knowledge graphs and their biomedical applications. Comput. Struct. Biotechnol. J. 18, 1414–1428 (2020). - PMC - PubMed
    1. Alshahrani M., Thafar M. A. & Essack M. Application and evaluation of knowledge graph embeddings in biomedical data. PeerJ Comput Sci 7, e341 (2021).
    1. Alshahrani M. & Hoehndorf R. Drug repurposing through joint learning on knowledge graphs and literature. bioRxiv 385617 (2018) doi: 10.1101/385617. - DOI

Publication types

LinkOut - more resources