. 2020 Mar 17:9:e52614.

doi: 10.7554/eLife.52614.

Wikidata as a knowledge graph for the life sciences

Andra Waagmeester^#¹, Gregory Stupp^#², Sebastian Burgstaller-Muehlbacher³, Benjamin M Good², Malachi Griffith⁴, Obi L Griffith⁴, Kristina Hanspers⁵, Henning Hermjakob⁶, Toby S Hudson⁷, Kevin Hybiske⁸, Sarah M Keating⁶, Magnus Manske⁹, Michael Mayers², Daniel Mietchen¹⁰, Elvira Mitraka¹¹, Alexander R Pico⁵, Timothy Putman², Anders Riutta⁵, Nuria Queralt-Rosinach², Lynn M Schriml¹¹, Thomas Shafee¹², Denise Slenter¹³, Ralf Stephan¹⁴, Katherine Thornton¹⁵, Ginger Tsueng², Roger Tu², Sabah Ul-Hasan², Egon Willighagen¹³, Chunlei Wu², Andrew I Su²

Affiliations

¹ Micelio, Antwerpen, Belgium.
² Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, United States.
³ Center for Integrative Bioinformatics Vienna, Max Perutz Laboratories, University of Vienna and Medical University of Vienna, Vienna, Austria.
⁴ McDonnell Genome Institute, Washington University School of Medicine, St. Louis, United States.
⁵ Institute of Data Science and Biotechnology, Gladstone Institutes, San Francisco, United States.
⁶ European Bioinformatics Institute (EMBL-EBI), Hinxton, United Kingdom.
⁷ School of Chemistry, The University of Sydney, Sydney, Australia.
⁸ Division of Allergy and Infectious Diseases, Department of Medicine, University of Washington, Seattle, United States.
⁹ Wellcome Trust Sanger Institute, Cambridge, United Kingdom.
¹⁰ School of Data Science, University of Virginia, Charlottesville, United States.
¹¹ University of Maryland School of Medicine, Baltimore, United States.
¹² Department of Animal Plant and Soil Sciences, La Trobe University, Melbourne, Australia.
¹³ Department of Bioinformatics-BiGCaT, NUTRIM, Maastricht University, Maastricht, Netherlands.
¹⁴ Retired researcher, Berlin, Germany.
¹⁵ Yale University Library, Yale University, New Haven, United States.

^# Contributed equally.

PMID: 32180547
PMCID: PMC7077981
DOI: 10.7554/eLife.52614

Wikidata as a knowledge graph for the life sciences

Andra Waagmeester et al. Elife. 2020.

. 2020 Mar 17:9:e52614.

doi: 10.7554/eLife.52614.

Authors

Affiliations

¹ Micelio, Antwerpen, Belgium.
² Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, United States.
³ Center for Integrative Bioinformatics Vienna, Max Perutz Laboratories, University of Vienna and Medical University of Vienna, Vienna, Austria.
⁴ McDonnell Genome Institute, Washington University School of Medicine, St. Louis, United States.
⁵ Institute of Data Science and Biotechnology, Gladstone Institutes, San Francisco, United States.
⁶ European Bioinformatics Institute (EMBL-EBI), Hinxton, United Kingdom.
⁷ School of Chemistry, The University of Sydney, Sydney, Australia.
⁸ Division of Allergy and Infectious Diseases, Department of Medicine, University of Washington, Seattle, United States.
⁹ Wellcome Trust Sanger Institute, Cambridge, United Kingdom.
¹⁰ School of Data Science, University of Virginia, Charlottesville, United States.
¹¹ University of Maryland School of Medicine, Baltimore, United States.
¹² Department of Animal Plant and Soil Sciences, La Trobe University, Melbourne, Australia.
¹³ Department of Bioinformatics-BiGCaT, NUTRIM, Maastricht University, Maastricht, Netherlands.
¹⁴ Retired researcher, Berlin, Germany.
¹⁵ Yale University Library, Yale University, New Haven, United States.

^# Contributed equally.

PMID: 32180547
PMCID: PMC7077981
DOI: 10.7554/eLife.52614

Abstract

Wikidata is a community-maintained knowledge base that has been assembled from repositories in the fields of genomics, proteomics, genetic variants, pathways, chemical compounds, and diseases, and that adheres to the FAIR principles of findability, accessibility, interoperability and reusability. Here we describe the breadth and depth of the biomedical knowledge contained within Wikidata, and discuss the open-source tools we have built to add information to Wikidata and to synchronize it with source databases. We also demonstrate several use cases for Wikidata, including the crowdsourced curation of biomedical ontologies, phenotype-based diagnosis of disease, and drug repurposing.

Keywords: computational biology; data mining; drug repurposing; knowledge graphs; none; science forum; systems biology; wikidata.

PubMed Disclaimer

Conflict of interest statement

AW, GS, SB, BG, MG, OG, KH, HH, TH, KH, SK, MM, MM, DM, EM, AP, TP, AR, NQ, LS, TS, DS, RS, KT, GT, RT, SU, EW, CW, AS No competing interests declared

Figures

**Figure 1.. A simplified class-level diagram of the Wikidata knowledge graph for biomedical entities.**
Each box represents one type of biomedical entity. The header displays the name of that entity type (e.g., pharmaceutical product) and the number of Wikidata items for that entity type. The lower portion of each box displays a partial listing of attributes about each entity type and the number of Wikidata items for each attribute. Edges between boxes represent the number of Wikidata statements corresponding to each combination of subject type, predicate, and object type. For example, there are 1505 statements with 'pharmaceutical product' as the subject type, 'therapeutic area' as the predicate, and 'disease' as the object type. For clarity, edges for reciprocal relationships (e.g., 'has part' and 'part of') are combined into a single edge, and scientific articles (which are widely cited in statement references) have been omitted. All counts of Wikidata items are current as of September 2019. The most common data sources cited as references are available in Figure 1—source data 1. Data are generated using the code in https://github.com/SuLab/genewikiworld (archived at Mayers et al., 2020). A more complete version of this graph diagram can be found at https://commons.wikimedia.org/wiki/File:Biomedical_Knowledge_Graph_in_Wikidata.svg.

**Figure 2.. Generalizable SPARQL template for identifier translation.**
SPARQL is the primary query language for accessing Wikidata content. These simple SPARQL examples show how identifiers of any biological type can easily be translated using SPARQL queries. The top query demonstrates the translation of a small list of gene symbols (wdt:P353) to Entrez Gene IDs (wdt:P351), while the bottom example shows conversion of RxNorm concept IDs (wdt:P3345) to NDF-RT IDs (wdt:P2115). These queries can be submitted to the Wikidata Query Service (WDQS; https://query.wikidata.org/) to get real-time results. Translation to and from a wide variety of identifier types can be performed using slight modifications on these templates, and relatively simple extensions of these queries can filter mappings based on the statement references and/or qualifiers. A full list of Wikidata properties can be found at https://www.wikidata.org/wiki/Special:ListProperties. Note that for translating a large number of identifiers, it is often more efficient to perform a SPARQL query to retrieve all mappings and then perform additional filtering locally.

**Figure 3.. A representative SPARQL query that integrates data from multiple data resources and annotation types.**
This example integrative query incorporates data on genetic associations to disease, Gene Ontology annotations for cellular compartment, protein target information for compounds, pathway data, and protein domain information. Specifically, this query (depicted schematically at right) retrieves genes that are (i) associated with a respiratory system disease, (ii) that encode a membrane-bound protein, and (**iii**) that sit within the same biochemical pathway as (iv) a second gene encoding a protein with a serine-threonine kinase domain and (v) a known inhibitor, and reports a list of those inhibitors. Aspects related to Disease Ontology in blue; aspects related to biochemistry in red/orange; aspects related to chemistry in green. Properties are shown in italics. Real-time query results can be viewed at https://w.wiki/6pZ.

**Figure 4.. BOQA analysis of suspected cases of the disease Congenital Disorder of Deglycosylation (CDDG).**
We used an algorithm called BOQA to rank potential diagnoses based on clinical phenotypes. Here, clinical phenotypes from two cases of suspected CDDG patients were extracted from a published case report (Caglayan et al., 2015). These phenotypes were run through BOQA using phenotype-disease annotations from the Human Phenotype Ontology (HPO) alone, or from a combination of HPO and Wikidata. This analysis was tested using several versions of disease-phenotype annotations (shown along the x-axis). The probability score for CDDG is reported on the y-axis. These results demonstrate that the inclusion of Wikidata-based disease-phenotype annotations would have significantly improved the diagnosis predictions from BOQA at earlier time points prior to their official inclusion in the HPO annotation file. Details of this analysis can be found at https://github.com/SuLab/Wikidata-phenomizer (archived at Tu et al., 2020).

**Figure 5.. Drug repurposing using the Wikidata knowledge graph.**
We analyzed three snapshots of Wikidata using Rephetio, a graph-based algorithm for predicting drug repurposing candidates (Himmelstein et al., 2017). We evaluated the performance of the Rephetio algorithm on three historical versions of the Wikidata knowledge graph, quantified based on the area under the receiver operator characteristic curve (AUC). This analysis demonstrated that the performance of Rephetio in drug repurposing improved over time based only on improvements to the underlying knowledge graph. Details of this analysis can be found at https://github.com/SuLab/WD-rephetio-analysis (archived at Mayers and Su, 2020).

**Figure 5—figure supplement 1.. Drug repurposing using the Wikidata knowledge graph, evaluated using an external test set.**
The analysis in Figure 5 was based on a cross-validation of indications that were present in Wikidata. This time-resolved analysis was run using an external gold standard set of indications from Drug Central (Ursu et al., 2017).

See this image and copyright information in PMC

References

1. Agarwala R, Barrett T, Beck J, Benson DA, Bollin C, Bolton E, Bourexis D, Brister JR, Bryant SH, Canese K, Cavanaugh M, Charowhas C, Clark K, Dondoshansky I, Feolo M, Fitzpatrick L, Funk K, Geer LY, Gorelenkov V, Graeff A, Hlavina W, Holmes B, Johnson M, Kattman B, Khotomlianski V, Kimchi A, Kimelman M, Kimura M, Kitts P, Klimke W, Kotliarov A, Krasnov S, Kuznetsov A, Landrum MJ, Landsman D, Lathrop S, Lee JM, Leubsdorf C, Lu Z, Madden TL, Marchler-Bauer A, Malheiro A, Meric P, Karsch-Mizrachi I, Mnev A, Murphy T, Orris R, Ostell J, O'Sullivan C, Palanigobu V, Panchenko AR, Phan L, Pierov B, Pruitt KD, Rodarmer K, Sayers EW, Schneider V, Schoch CL, Schuler GD, Sherry ST, Siyan K, Soboleva A, Soussov V, Starchenko G, Tatusova TA, Thibaud-Nissen F, Todorov K, Trawick BW, Vakatov D, Ward M, Yaschenko E, Zasypkin A, Zbicz K, Coordinators NR, NCBI Resource Coordinators Database resources of the National Center for Biotechnology Information. Nucleic Acids Research. 2018;46:D8–D13. doi: 10.1093/nar/gkx1095. - DOI - PMC - PubMed
1. Amberger JS, Hamosh A. Searching Online Mendelian Inheritance in Man (OMIM): A knowledgebase of human genes and genetic phenotypes. Current Protocols in Bioinformatics. 2017;58:27. doi: 10.1002/cpbi.27. - DOI - PMC - PubMed
1. Ayers P, Mietchen D, Orlowitz J, Proffitt M, Rodlund S, Seiver E, Taraborelli D, Vershbow B. WikiCite 2018-2019: Citations for the Sum of All Human Knowledge. Wikimedia Foundation; 2019.
1. Bastian F, Parmentier G, Roux J, Moretti S, Laudet V, Robinson-Rechavi M. Bgee: Integrating and Comparing Heterogeneous Transcriptome Data Among Species. In: Bairoch A, Cohen-Boulakia S, Froidevaux C, editors. Data Integration in the Life Sciences, Lecture Notes in Computer Science. Berlin Heidelberg: Springer; 2008. pp. 124–131. - DOI
1. Bauer S, Köhler S, Schulz MH, Robinson PN. Bayesian ontology querying for accurate and noise-tolerant semantic searches. Bioinformatics. 2012;28:2502–2508. doi: 10.1093/bioinformatics/bts471. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Wikidata as a knowledge graph for the life sciences

Affiliations

Wikidata as a knowledge graph for the life sciences

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources