BioFed: federated query processing over life sciences linked open data
- PMID: 28298238
- PMCID: PMC5353896
- DOI: 10.1186/s13326-017-0118-0
BioFed: federated query processing over life sciences linked open data
Abstract
Background: Biomedical data, e.g. from knowledge bases and ontologies, is increasingly made available following open linked data principles, at best as RDF triple data. This is a necessary step towards unified access to biological data sets, but this still requires solutions to query multiple endpoints for their heterogeneous data to eventually retrieve all the meaningful information. Suggested solutions are based on query federation approaches, which require the submission of SPARQL queries to endpoints. Due to the size and complexity of available data, these solutions have to be optimised for efficient retrieval times and for users in life sciences research. Last but not least, over time, the reliability of data resources in terms of access and quality have to be monitored. Our solution (BioFed) federates data over 130 SPARQL endpoints in life sciences and tailors query submission according to the provenance information. BioFed has been evaluated against the state of the art solution FedX and forms an important benchmark for the life science domain.
Methods: The efficient cataloguing approach of the federated query processing system 'BioFed', the triple pattern wise source selection and the semantic source normalisation forms the core to our solution. It gathers and integrates data from newly identified public endpoints for federated access. Basic provenance information is linked to the retrieved data. Last but not least, BioFed makes use of the latest SPARQL standard (i.e., 1.1) to leverage the full benefits for query federation. The evaluation is based on 10 simple and 10 complex queries, which address data in 10 major and very popular data sources (e.g., Dugbank, Sider).
Results: BioFed is a solution for a single-point-of-access for a large number of SPARQL endpoints providing life science data. It facilitates efficient query generation for data access and provides basic provenance information in combination with the retrieved data. BioFed fully supports SPARQL 1.1 and gives access to the endpoint's availability based on the EndpointData graph. Our evaluation of BioFed against FedX is based on 20 heterogeneous federated SPARQL queries and shows competitive execution performance in comparison to FedX, which can be attributed to the provision of provenance information for the source selection.
Conclusion: Developing and testing federated query engines for life sciences data is still a challenging task. According to our findings, it is advantageous to optimise the source selection. The cataloguing of SPARQL endpoints, including type and property indexing, leads to efficient querying of data resources over the Web of Data. This could even be further improved through the use of ontologies, e.g., for abstract normalisation of query terms.
Keywords: Life sciences dataset; Linked open data; SPARQL query federation.
Figures




Similar articles
-
SAFE: SPARQL Federation over RDF Data Cubes with Access Control.J Biomed Semantics. 2017 Feb 1;8(1):5. doi: 10.1186/s13326-017-0112-6. J Biomed Semantics. 2017. PMID: 28148277 Free PMC article.
-
TopFed: TCGA tailored federated query processing and linking to LOD.J Biomed Semantics. 2014 Dec 3;5:47. doi: 10.1186/2041-1480-5-47. eCollection 2014. J Biomed Semantics. 2014. PMID: 25937882 Free PMC article.
-
A journey to Semantic Web query federation in the life sciences.BMC Bioinformatics. 2009 Oct 1;10 Suppl 10(Suppl 10):S10. doi: 10.1186/1471-2105-10-S10-S10. BMC Bioinformatics. 2009. PMID: 19796394 Free PMC article.
-
LinkHub: a Semantic Web system that facilitates cross-database queries and information retrieval in proteomics.BMC Bioinformatics. 2007 May 9;8 Suppl 3(Suppl 3):S5. doi: 10.1186/1471-2105-8-S3-S5. BMC Bioinformatics. 2007. PMID: 17493288 Free PMC article. Review.
-
Implementation of linked data in the life sciences at BioHackathon 2011.J Biomed Semantics. 2015 Jan 7;6:3. doi: 10.1186/2041-1480-6-3. eCollection 2015. J Biomed Semantics. 2015. PMID: 25973165 Free PMC article. Review.
Cited by
-
PIBAS FedSPARQL: a web-based platform for integration and exploration of bioinformatics datasets.J Biomed Semantics. 2017 Sep 20;8(1):42. doi: 10.1186/s13326-017-0151-z. J Biomed Semantics. 2017. PMID: 28931422 Free PMC article.
-
Bio-SODA UX: enabling natural language question answering over knowledge graphs with user disambiguation.Distrib Parallel Databases. 2022;40(2-3):409-440. doi: 10.1007/s10619-022-07414-w. Epub 2022 Jul 16. Distrib Parallel Databases. 2022. PMID: 36097541 Free PMC article.
-
Authors' attitude toward adopting a new workflow to improve the computability of phenotype publications.Database (Oxford). 2022 Feb 2;2022:baac001. doi: 10.1093/database/baac001. Database (Oxford). 2022. PMID: 35106535 Free PMC article.
-
The Gene Ontology resource: enriching a GOld mine.Nucleic Acids Res. 2021 Jan 8;49(D1):D325-D334. doi: 10.1093/nar/gkaa1113. Nucleic Acids Res. 2021. PMID: 33290552 Free PMC article.
-
Enabling semantic queries across federated bioinformatics databases.Database (Oxford). 2019 Jan 1;2019:baz106. doi: 10.1093/database/baz106. Database (Oxford). 2019. PMID: 31697362 Free PMC article.
References
-
- Saleem M, Khan Y, Hasnain A, Ermilov I, Ngomo A-CN. A fine-grained evaluation of sparql endpoint federation systems. Semantic Web Journal. 2014. http://content.iospress.com/articles/semantic-web/sw186. Accessed 5 Feb 2017.
-
- Saleem M, Shanmukha S, Ngonga AC, Almeida JS, Decker S, Deus HF. Linked cancer genome atlas database. In: I-Semantics 2013: 2013. p. 129–34. http://dl.acm.org/citation.cfm?id=2506200. Accessed 5 Feb 2017.
-
- Saleem M, Padmanabhuni SS, Ngomo A-CN, Iqbal A, Almeida JS, Decker S, Deus HF. TopFed: TCGA tailored federated query processing and linking to LOD. J Biomed Semantics. 2014:1–33. https://jbiomedsem.biomedcentral.com/articles/10.1186/2041-1480-5-47. Accessed 5 Feb 2017. - DOI - PMC - PubMed
-
- Hasnain A, Zainab SSE, Kamdar MR, Mehmood Q, Warren Jr C, et al. A roadmap for navigating the life scinces linked open data cloud. In: International Semantic Technology (JIST2014) Conference: 2014. http://link.springer.com/chapter/10.1007/978-3-319-15615-6_8. Accessed 5 Feb 2017. - DOI
-
- Hasnain A, Mehmood Q, Sana e Zainab S, Hogan A. SPORTAL: Profiling the Content of Public SPARQL Endpoints. International Journal on Semantic Web and Information Systems (IJSWIS). 2016; 12(3):134–163. doi:10.4018/IJSWIS.2016070105. - DOI
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources