Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Feb 2;10(1):67.
doi: 10.1038/s41597-023-01960-3.

Building a knowledge graph to enable precision medicine

Affiliations

Building a knowledge graph to enable precision medicine

Payal Chandak et al. Sci Data. .

Abstract

Developing personalized diagnostic strategies and targeted treatments requires a deep understanding of disease biology and the ability to dissect the relationship between molecular and genetic factors and their phenotypic consequences. However, such knowledge is fragmented across publications, non-standardized repositories, and evolving ontologies describing various scales of biological organization between genotypes and clinical phenotypes. Here, we present PrimeKG, a multimodal knowledge graph for precision medicine analyses. PrimeKG integrates 20 high-quality resources to describe 17,080 diseases with 4,050,249 relationships representing ten major biological scales, including disease-associated protein perturbations, biological processes and pathways, anatomical and phenotypic scales, and the entire range of approved drugs with their therapeutic action, considerably expanding previous efforts in disease-rooted knowledge graphs. PrimeKG contains an abundance of 'indications', 'contradictions', and 'off-label use' drug-disease edges that lack in other knowledge graphs and can support AI analyses of how drugs affect disease-associated networks. We supplement PrimeKG's graph structure with language descriptions of clinical guidelines to enable multimodal analyses and provide instructions for continual updates of PrimeKG as new data become available.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Overview of PrimeKG multimodal knowledge graph. (a) Shown is a schematic overview of the various types of nodes in PrimeKG and the relationships they have with other nodes in the graph. (b) All disease nodes in PrimeKG shown in a circular layout together with disease-associated information. All relationships between disease nodes and any other node type are depicted here. Disease nodes are densely connected to four other node types in PrimeKG through seven types of relations. (c) Shown is an example of paths in PrimeKG between the disease node ‘Autism’ and the drug node ‘Risperidone’. Intermediate nodes are colored by their node type from panel a. We also display snippets of text features for both nodes to demonstrate the multimodality of PrimeKG. Abbreviations - MF: molecular function, BP: biological process, CC: cellular component, APZ: Apiprazole, EPI: epilepsy, ABP: abdominal pain, +/− associations: positive and negative associations.
Fig. 2
Fig. 2
Building PrimeKG. The panels sequentially illustrate the process of developing the Precision Medicine Knowledge Graph. (a) Shown are 20 primary data resources curated to develop PrimeKG. The colors highlight which data records are used to uniquely identify each node type. For example, GO is colored by biological processes, cellular components, and molecular functions because GO terms are the unique identifiers used to define nodes for these three node types. (b) Primary resources are colored by each node type for which they possess information. For example, GO provides links from biological processes, cellular components, and molecular functions to genes. As a result, we add the fourth color to represent the gene/protein class. (c) Illustrated is the process of harmonizing these primary data records to extract relationships between node types. (d) The left side illustrates PrimeKG, and the right side shows all the textual sources of clinical information on drugs and diseases. The node type legend is consistent across the figure. Abbreviations - MF: molecular function, BP: biological process, CC: cellular component, PPI: protein-protein interactions, DO: disease ontology, MONDO: MONDO disease ontology, Entrez: Entrez gene, GO: gene ontology, UMLS: unified medical language system, HPO: human phenotype ontology, CTD: comparative toxicogenomics database, SIDER: side effect resource.
Fig. 3
Fig. 3
Reconciling autism disease nodes into clinically relevant entities. (a) The left side shows three clinically determined subtypes of autism. The right side shows autism-related disease terms across three ontologies: MONDO, UMLS, and Orphanet. While we can identify mappings across the ontologies, it is unclear how the terms in any ontology connect to clinical subtypes. (b) Illustration on how we use a language model, ClinicalBERT, to map terms from MONDO into a latent embedding space. Because the language model can group synonyms in the embedding space, we can cluster MONDO terms with similar semantic and medical meanings by calculating cosine similarity between embeddings of disease concepts. These clusters are created to develop disease groupings, as shown on the right in panel b. Abbreviations - MONDO: MONDO disease ontology, UMLS: unified medical language system.

References

    1. Adams SA, Petersen C. Precision medicine: opportunities, possibilities, and challenges for patients and providers. Journal of the American Medical Informatics Association: JAMIA. 2016;23:787–790. - PMC - PubMed
    1. Prosperi M, Min JS, Bian J, Modave F. Big data hurdles in precision medicine and precision public health. BMC Medical Informatics and Decision Making. 2018;18:139. - PMC - PubMed
    1. Gogleva A, et al. Knowledge graph-based recommendation framework identifies drivers of resistance in EGFR mutant non-small cell lung cancer. Nature Communications. 2022;13:1–14. - PMC - PubMed
    1. Hulsen, T. et al. From big data to precision medicine. Frontiers in Medicine6 (2019). - PMC - PubMed
    1. Ping P, Watson K, Han J, Bui A. Individualized knowledge graph: a viable informatics path to precision medicine. Circulation Research. 2017;120:1078–1080. - PMC - PubMed