Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2020 Aug 18:2020.08.17.254839.
doi: 10.1101/2020.08.17.254839.

KG-COVID-19: a framework to produce customized knowledge graphs for COVID-19 response

KG-COVID-19: a framework to produce customized knowledge graphs for COVID-19 response

Justin Reese et al. bioRxiv. .

Update in

  • KG-COVID-19: A Framework to Produce Customized Knowledge Graphs for COVID-19 Response.
    Reese JT, Unni D, Callahan TJ, Cappelletti L, Ravanmehr V, Carbon S, Shefchek KA, Good BM, Balhoff JP, Fontana T, Blau H, Matentzoglu N, Harris NL, Munoz-Torres MC, Haendel MA, Robinson PN, Joachimiak MP, Mungall CJ. Reese JT, et al. Patterns (N Y). 2021 Jan 8;2(1):100155. doi: 10.1016/j.patter.2020.100155. Epub 2020 Nov 9. Patterns (N Y). 2021. PMID: 33196056 Free PMC article.

Abstract

Integrated, up-to-date data about SARS-CoV-2 and coronavirus disease 2019 (COVID-19) is crucial for the ongoing response to the COVID-19 pandemic by the biomedical research community. While rich biological knowledge exists for SARS-CoV-2 and related viruses (SARS-CoV, MERS-CoV), integrating this knowledge is difficult and time consuming, since much of it is in siloed databases or in textual format. Furthermore, the data required by the research community varies drastically for different tasks - the optimal data for a machine learning task, for example, is much different from the data used to populate a browsable user interface for clinicians. To address these challenges, we created KG-COVID-19, a flexible framework that ingests and integrates biomedical data to produce knowledge graphs (KGs) for COVID-19 response. This KG framework can also be applied to other problems in which siloed biomedical data must be quickly integrated for different research applications, including future pandemics.

Bigger picture: An effective response to the COVID-19 pandemic relies on integration of many different types of data available about SARS-CoV-2 and related viruses. KG-COVID-19 is a framework for producing knowledge graphs that can be customized for downstream applications including machine learning tasks, hypothesis-based querying, and browsable user interface to enable researchers to explore COVID-19 data and discover relationships.

PubMed Disclaimer

Conflict of interest statement

DECLARATION OF INTERESTS

The authors declare no competing interests.

Figures

Figure 1.
Figure 1.
The KG-COVID-19 framework for producing KGs. The framework is divided into three modular steps: download, transform, and merge. A) The download step retrieves all data sets needed for ingestion using a set of URLs specified in a YAML file. B) The transform step applies Python code that is specific to each source to transform the most useful elements of each source and emit a graph in TSV format. C) The merge step uses a YAML file to read the user-specified data sets (among those produced in the transform step) and merge them into a single KG. Different YAML files can be constructed to mix and match different input data from B, but each merge operation yields a single merged graph. Both the transform and merge steps rely heavily on KGX, a powerful tool for manipulating knowledge graphs (https://github.com/NCATS-Tangerine/kgx).
Figure 2.
Figure 2.
A typical transformation of records from an input file into entries in a nodes.tsv and edges.tsv file representing the nodes and edge in a graph. These nodes and the edge can be further transformed into RDF triples.
Figure 3.
Figure 3.
Schematic representation of the data currently ingested into the KG-COVID-19 knowledge graph. (Top) Polygons shown correspond to the various data sources currently ingested into the KG, and the small colored circles indicate the data types ingested from this source. (Bottom) Sankey plot showing the Biolink categories for edges in the KG-COVID-19 graph. Left and middle columns show Biolink categories for edges, right column indicates the source of the data from which the edges were derived. Line widths are proportional to the number of edges.
Figure 4.
Figure 4.
Workflow for machine learning application of KG-COVID-19 knowledge graph. A. In order to train classifiers for use in link prediction, training and test graphs are first produced from the original KG-COVID-19 graph (see Experimental Procedures). These graphs are used by Embiggen to generate random walks, embeddings, and finally a classifier. The test graphs are used to assess the performance of the classifier. This step is performed iteratively in order to identify optimal hyperparameters. B. The classifiers are applied to the KG-COVID-19 to perform link prediction in order to identify links that correspond to actionable knowledge: for example, links between drugs and the COVID-19 disease, links between drugs and SARS-CoV-2 protein targets, and links between drugs and host proteins that are involved in COVID-19 disease processes.
Figure 5.
Figure 5.
Hypothesis-based querying of KG-COVID-19 knowledge graph for using SPARQL queries. (Top) A SPARQL query retrieves approved drugs that target human proteins that physically interact with SARS-CoV-2 protein. (Bottom) A SPARQL query retrieves approved drugs that target human proteins that physically interact indirectly with SARS-CoV-2 through another human protein. The suitability of these drugs for repositioning are evaluated by NVBL collaborators, for example by analyzing available structural data to support repositioning.
Figure 6.
Figure 6.
Visualization of KG-COVID-19 knowledge graph node embeddings using t-SNE. Embeddings were created for each node in the KG-COVID-19 knowledge graph and t-SNE was performed as described in Experimental Procedures. Nodes categorized with one of the ten most numerous Biolink categories were then selected. Colors indicate the Biolink category for each node.

References

    1. Gandhi RT, Lynch JB, Del Rio C. Mild or Moderate Covid-19. N Engl J Med [Internet]. 2020. April 24; Available from: 10.1056/NEJMcp2009249 - DOI - PubMed
    1. Berlin DA, Gulick RM, Martinez FJ. Severe Covid-19. N Engl J Med [Internet]. 2020. May 15; Available from: 10.1056/NEJMcp2009575 - DOI - PubMed
    1. Srivastava K. Association between COVID-19 and cardiovascular disease. IJC Heart & Vasculature [Internet]. 2020. August 1;29:100583 Available from: http://www.sciencedirect.com/science/article/pii/S2352906720302815 - PMC - PubMed
    1. Beigel JH, Tomashek KM, Dodd LE, Mehta AK, Zingman BS, Kalil AC, et al. Remdesivir for the Treatment of Covid-19 - Preliminary Report. N Engl J Med [Internet]. 2020. May 22; Available from: 10.1056/NEJMoa2007764 - DOI - PubMed
    1. Horby P, Lim WS, Emberson J, Mafham M, Bell J, Linsell L, et al. Effect of Dexamethasone in Hospitalized Patients with COVID-19: Preliminary Report [Internet]. Infectious Diseases (except HIV/AIDS). medRxiv; 2020. Available from: https://www.medrxiv.org/content/10.1101/2020.06.22.20137273v1 - DOI

Publication types