Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2019 Sep 10:2:90.
doi: 10.1038/s41746-019-0162-5. eCollection 2019.

Enabling Web-scale data integration in biomedicine through Linked Open Data

Affiliations
Review

Enabling Web-scale data integration in biomedicine through Linked Open Data

Maulik R Kamdar et al. NPJ Digit Med. .

Abstract

The biomedical data landscape is fragmented with several isolated, heterogeneous data and knowledge sources, which use varying formats, syntaxes, schemas, and entity notations, existing on the Web. Biomedical researchers face severe logistical and technical challenges to query, integrate, analyze, and visualize data from multiple diverse sources in the context of available biomedical knowledge. Semantic Web technologies and Linked Data principles may aid toward Web-scale semantic processing and data integration in biomedicine. The biomedical research community has been one of the earliest adopters of these technologies and principles to publish data and knowledge on the Web as linked graphs and ontologies, hence creating the Life Sciences Linked Open Data (LSLOD) cloud. In this paper, we provide our perspective on some opportunities proffered by the use of LSLOD to integrate biomedical data and knowledge in three domains: (1) pharmacology, (2) cancer research, and (3) infectious diseases. We will discuss some of the major challenges that hinder the wide-spread use and consumption of LSLOD by the biomedical research community. Finally, we provide a few technical solutions and insights that can address these challenges. Eventually, LSLOD can enable the development of scalable, intelligent infrastructures that support artificial intelligence methods for augmenting human intelligence to achieve better clinical outcomes for patients, to enhance the quality of biomedical research, and to improve our understanding of living systems.

Keywords: Computational platforms and environments; Data integration; Databases.

PubMed Disclaimer

Conflict of interest statement

Competing interestsThe authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Diagrammatic representation of Linked Data and knowledge. RDF facilitates representation and data merging by extending the linking structure of the web. Entities in different sources (e.g., PDGFR protein in Source 1 and Source 3) are represented using a unique URI. Disparate sources can have independent facts (or triples) such as (Gleevec, has-target, PDGFR) and (PDGFR, is-implicated-in, Glioma), or other data (e.g., molecular weight of drugs, pathway information for proteins) that can be easily linked and integrated using RDF. A human user or a computational agent should, ideally, be able to navigate this Web of data to generate novel hypotheses (e.g., (Gleevec, possibly-associated-with, PDGFR)) and discover relevant data and knowledge in other sources (e.g., cytotoxicity assay data in Source 2)
Fig. 2
Fig. 2
SPARQL Query Federation. a Two sources—KEGG, a knowledge base of biochemical pathways, and DrugBank, a database containing molecular characteristics of drugs, are available as RDF Graphs on the LSLOD cloud (The LSLOD cloud image is derived with permission under a CC-BY Attribution 4.0 International Licence from the LOD cloud diagram at lod-cloud.net after cropping modifications). Snippets of the KEGG and DrugBank RDF graphs are respectively shown, and similar Drug entities in these RDF graphs are mapped using the ‘x-ref’ link. b An intelligent query federation architecture can determine which SPARQL endpoint to query based on the content of the underlying RDF graphs (i.e., drug–protein interaction knowledge from KEGG, and half-life information from DrugBank). c The user-provided query is shown using a visual SPARQL representation, with variable nodes ?dr (drugs), ?pr (proteins), and ?hl (half-lives of drugs). This query is executed by the user against the query federation architecture. d The query federation architecture returns a result set to the user (e.g., Gleevec targets PDGFR, and has a half-life of “18 h”)
Fig. 3
Fig. 3
Interacting with Big Linked Cancer Data through the GenomeSnip visualization perspective. The Linked TCGA project provides several different visualization perspectives for biomedical researchers to explore and visualize integrated content from the following LSLOD data sources: (i) MESH, (ii) HGNC, (iii) KEGG, (iv) PubMed, (v) UniProt, and (vi) Linked TCGA. The GenomeSnip perspective provides an aggregative circular visualization of the human genome, and allows the user to interactively explore different genomic regions at different scales—a chromosome, b ideogram, and c gene and other regulatory regions (e.g., enhancers). d Relations (protein–protein interactions, gene co-mentions, etc.), as well as communities of genes or genomic regulatory regions, as detected by a community-detection or clustering algorithm, can also be visualized. The GenomeSnip perspective is available online at http://onto-apps.stanford.edu/genomesnip
Fig. 4
Fig. 4
Challenges in consuming LSLOD content for biomedical applications. a Different LSLOD sources may use different URI representations for the same entity (e.g., different ChEBI URIs http://bio2rdf.org/chebi:31690 and http://purl.obolibrary.org/obo/CHEBI/31690 for the entity Gleevec). Hence, link traversal or query federation methods are not able to integrate content from KEGG and ChEMBL RDF graphs even when they have ‘x-ref’ links to the ‘similar’ ChEBI entity. b Different RDF graphs may use different semantics (e.g., drug-target and target). Different graph patterns may be used to depict the same relation, while capturing additional details. c Through a systematic analysis of biomedical ontologies in BioPortal repository, we determined that while a significant overlap of content exists across biomedical ontologies, most ontologies reuse less than 5% of their terms with several ontologies using incorrect term URIs (Graph generated from data presented in Kamdar et al.). d Unique drug–protein target interactions may exist across different data and knowledge sources, since these sources are published with different methods and intentions (Figure used with permission under a CC-BY Attribution 4.0 International License from Kamdar et al.). e Real-world SPARQL query to retrieve drug–protein target interactions from four different LSLOD sources–DrugBank, KEGG, PharmGKb and Comparative Toxicogenomics Database. f Real-world SPARQL query to retrieve activity, target, and pathway information for ligands interacting with the Ebola virus polymerase protein

References

    1. Wetterstrand, K. A. DNA sequencing costs: Data from the NHGRI genome sequencing program (GSP). www.genome.gov/sequencingcostsdata. Accessed 30 May 2018.
    1. Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Inf. Sci. Syst. 2014;2:3. doi: 10.1186/2047-2501-2-3. - DOI - PMC - PubMed
    1. Jha AK. Meaningful use of electronic health records: the road ahead. JAMA. 2010;304:1709–1710. doi: 10.1001/jama.2010.1497. - DOI - PubMed
    1. Islam SR, Kwak D, Kabir MH, Hossain M, Kwak K-S. The internet of things for health care: a comprehensive survey. IEEE Access. 2015;3:678–708. doi: 10.1109/ACCESS.2015.2437951. - DOI
    1. Wishart DS, et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006;34:D668–D672. doi: 10.1093/nar/gkj067. - DOI - PMC - PubMed