. 2017 Sep 20;8(1):42.

doi: 10.1186/s13326-017-0151-z.

PIBAS FedSPARQL: a web-based platform for integration and exploration of bioinformatics datasets

Marija Djokic-Petrovic^{1

2}, Vladimir Cvjetkovic³, Jeremy Yang^{4

5}, Marko Zivanovic⁶, David J Wild⁴

Affiliations

¹ Virtual World Services GmbH, Asperner Heldenplatz 6, 1220, Wien, Austria. m.djokic@kg.ac.rs.
² Department of Mathematics and Informatics, Faculty of Science, University of Kragujevac, Radoja Domanovica 12, Kragujevac, 34000, Serbia. m.djokic@kg.ac.rs.
³ Department of Mathematics and Informatics, Faculty of Science, University of Kragujevac, Radoja Domanovica 12, Kragujevac, 34000, Serbia.
⁴ School of Informatics and Computing, Indiana University, 901 E 10th St, Bloomington, Indiana, 47408, USA.
⁵ Translational Informatics Division, School of Medicine, University of New Mexico, Albuquerque, NM, 87131, USA.
⁶ Department of Biology and Ecology, Faculty of Science, University of Kragujevac, Radoja Domanovica 12, Kragujevac, 34 000, Serbia.

PMID: 28931422
PMCID: PMC5607505
DOI: 10.1186/s13326-017-0151-z

PIBAS FedSPARQL: a web-based platform for integration and exploration of bioinformatics datasets

Marija Djokic-Petrovic et al. J Biomed Semantics. 2017.

. 2017 Sep 20;8(1):42.

doi: 10.1186/s13326-017-0151-z.

Authors

Marija Djokic-Petrovic^{1

2}, Vladimir Cvjetkovic³, Jeremy Yang^{4

5}, Marko Zivanovic⁶, David J Wild⁴

Affiliations

¹ Virtual World Services GmbH, Asperner Heldenplatz 6, 1220, Wien, Austria. m.djokic@kg.ac.rs.
² Department of Mathematics and Informatics, Faculty of Science, University of Kragujevac, Radoja Domanovica 12, Kragujevac, 34000, Serbia. m.djokic@kg.ac.rs.
³ Department of Mathematics and Informatics, Faculty of Science, University of Kragujevac, Radoja Domanovica 12, Kragujevac, 34000, Serbia.
⁴ School of Informatics and Computing, Indiana University, 901 E 10th St, Bloomington, Indiana, 47408, USA.
⁵ Translational Informatics Division, School of Medicine, University of New Mexico, Albuquerque, NM, 87131, USA.
⁶ Department of Biology and Ecology, Faculty of Science, University of Kragujevac, Radoja Domanovica 12, Kragujevac, 34 000, Serbia.

PMID: 28931422
PMCID: PMC5607505
DOI: 10.1186/s13326-017-0151-z

Abstract

Background: There are a huge variety of data sources relevant to chemical, biological and pharmacological research, but these data sources are highly siloed and cannot be queried together in a straightforward way. Semantic technologies offer the ability to create links and mappings across datasets and manage them as a single, linked network so that searching can be carried out across datasets, independently of the source. We have developed an application called PIBAS FedSPARQL that uses semantic technologies to allow researchers to carry out such searching across a vast array of data sources.

Results: PIBAS FedSPARQL is a web-based query builder and result set visualizer of bioinformatics data. As an advanced feature, our system can detect similar data items identified by different Uniform Resource Identifiers (URIs), using a text-mining algorithm based on the processing of named entities to be used in Vector Space Model and Cosine Similarity Measures. According to our knowledge, PIBAS FedSPARQL was unique among the systems that we found in that it allows detecting of similar data items. As a query builder, our system allows researchers to intuitively construct and run Federated SPARQL queries across multiple data sources, including global initiatives, such as Bio2RDF, Chem2Bio2RDF, EMBL-EBI, and one local initiative called CPCTAS, as well as additional user-specified data source. From the input topic, subtopic, template and keyword, a corresponding initial Federated SPARQL query is created and executed. Based on the data obtained, end users have the ability to choose the most appropriate data sources in their area of interest and exploit their Resource Description Framework (RDF) structure, which allows users to select certain properties of data to enhance query results.

Conclusions: The developed system is flexible and allows intuitive creation and execution of queries for an extensive range of bioinformatics topics. Also, the novel "similar data items detection" algorithm can be particularly useful for suggesting new data sources and cost optimization for new experiments. PIBAS FedSPARQL can be expanded with new topics, subtopics and templates on demand, rendering information retrieval more robust.

Keywords: Bioinformatics; Data integration; Data mining and information retrieval; Federated SPARQL query; Ontologies.

PubMed Disclaimer

Conflict of interest statement

Authors’ information

MDJP is a Research Associate and PhD student of computer science at the Department of Mathematics and Informatics, Faculty of Science, University of Kragujevac, Serbia. She is currently employed as a software developer at an Austrian company that is supported by the Graz University of Technology. VC is an Assistant Professor at the Department of Mathematics and Informatics, Faculty of Science, University of Kragujevac, Serbia. JY is a research scientist at the School of Informatics and Computing, Indiana University and Translational Informatics Division, School of Medicine, University of New Mexico focused on bimolecular and biomedical data science. MZ is a Research Associate at the Department of Biology and Ecology, Faculty of Science, University of Kragujevac, Serbia. DW is an Associate Professor at Indiana University School of Informatics and Computing, and leads the Integrative Data Science Laboratory. This group created one of the original semantic data sources used in this work (Chem2Bio2RDF).

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
PIBAS FedSPARQL architecture overview. The architecture consists of two main layers: query engine and user interface. The user interface enables users to construct simple and advanced queries and view the results of their execution. The query engine preforms a series of demanding processes that needs to be done before queries can be executed. The main query engine component, *Data Source Manager*, scans the local *DataSources* ontology, reads the user’s input and passes the information through the *Query preparation* component to the SPARQL query runner component, where the queries are executed. The *Dataset projection* component plays a role in the “*Dynamic query filter*” feature, allowing users to easily discover the structure of underlying datasets included in Federated SPARQL queries. The *Detecting Similar Data Items* component identifies similar data items from results retrieved after running predefined queries or queries extended with new datasets

**Fig. 2**
Representations of basic relations in the *DataSources* ontology in the Protégé editor a) *Topic Biology* b) *Subtopic BiologyTarget* c) Template “*Found targets for the drug*” and some of its properties d) *PIBAS/CPCTAS* dataset instance. This figure shows screenshots of the local ontology *DataSources* in the Protégé ontology editor [50]. The ontology contains information about initiatives and datasets included in predefined Federated SPARQL queries. Each dataset in the ontology is represented as an instance of a certain class. The object property *conectedWith* connects dataset instances with template instances. Every *Subtopic class* instance is connected with a *Template class* instance through the object property *hasTemplate*. Every *Topic class* instance is connected with a *Subtopic class* instance through the object property *hasSubTopic*

**Fig. 3**
Predefined query of *Template2* for its pre-selected datasets. This figure shows the predefined Federated SPARQL query of the template “*Find targets for the drug*”. This query covers the *PIBAS/CPTAS*, *Drugbank/Bio2RDF*, *ChEMBL/EMBL-EBI* and *BindingDB/Chem2Bio2RDF* datasets. All predefined Federated SPARQL queries in the local *DataSources* ontology contain “%s” characters which represent objects values that will be replaced with the keyword entered by the user. The last “%s” character will be replaced with a particular pattern query if a new dataset is added using the “*Add new dataset*” feature

**Fig. 4**
Running of predefined query in PIBAS FedSPARQL a) Initial user interface b) Results after running predefined query. The initial user interface allows users to create queries in a very simple way by selecting a (sub)topic, template and entering a keyword. By clicking on the “*Run query*” button, the predefined Federated SPARQL query is executed and users receive results in the form of a table. The first column shows the retrieved results as URI or string. The second column displays the data source and initiative name. The icon in the top-right corner of the table shows statistical information about the retrieved data, including data source name, initiative name and the number of obtained data items per data source

formula image — **Fig. 4**
Running of predefined query in PIBAS FedSPARQL a) Initial user interface b) Results after running predefined query. The initial user interface allows users to create queries in a very simple way by selecting a (sub)topic, template and entering a keyword. By clicking on the “*Run query*” button, the predefined Federated SPARQL query is executed and users receive results in the form of a table. The first column shows the retrieved results as URI or string. The second column displays the data source and initiative name. The icon in the top-right corner of the table shows statistical information about the retrieved data, including data source name, initiative name and the number of obtained data items per data source

**Fig. 5**
Adding new dataset to predefined query. This figure shows the pop-up window that allows users to incorporate any new dataset not included in the predefined list of datasets for an existing template. Users need to enter the dataset name, initiative name, dataset link, a comment, the endpoint URL, pattern query and the dataset properties most relevant for the selected template and topic. Finally, they need to click the “*Add*” button to complete the action. Conversance with SPARQL and the underlying ontology is necessary for this step

**Fig. 6**
Rewritten predefined query after adding new dataset. This figure shows the rewritten predefined Federated SPARQL query of the template “*Find targets for the drug*” after incorporating a new test dataset

**Fig. 7**
Result set after adding new dataset to predefined query. This figure shows the results in a table after executing the rewritten predefined Federated SPARQL query. The results table has the same layout as in Fig. 4

**Fig. 8**
Accordion elements for dynamic query filtering a) List of predicates for *PIBAS/CPCTAS* dataset b) List of predicates for *BindingDB/Chem2Bio2RDF* dataset. This figure shows the dynamic accordion elements for the *PIBAS/CPCTAS* and *BindingDB/Chem2Bio2RDF* datasets. The accordion elements contain a list of dataset properties which are dynamically created according to the template “*Find targets for the drug*”. Each property listed in an accordion element is hyperlinked to a web page with its description. The same applies to all datasets used in Federated SPARQL query. Users can select their desired properties and add them to the query by clicking on the “*Add to query*” button

**Fig. 9**
Generated star-shaped query for BindingDB/Chem2Bio2RDF dataset after dynamic query filtering. This figure shows the star-shaped SPARQL query created for the Binding/Chem2Bio2RDF dataset after adding the properties http://chem2bio2rdf.org/bindingdb/resource/CID_GENE and http://chem2bio2rdf.org/bindingdb/resource/uniprot to the query

**Fig. 10**
A sample result table after dynamic query filtering. This figure shows the results of dynamic query filtering. The results are organized by source (*PIBAS/CPCTAS* and *BindingDB/Chem2Bio2RDF*) and displayed in a paginated table. They can be sorted and filtered in order to refine the query result and show only the most relevant information

**Fig. 11**
Process of string transformation. The process of string transformation implies conversion and filtering of a string. Initially, the string is converted to lower case. Then it passes through regular expression filtering to extract alphabetic and numeric characters [a-z, 0–9]. The string is then purified by eliminating words that are in the list of stop words. This list contains high-frequency words with relatively low information content (function words and pronouns). Finally, suffix removal is performed by applying Porter‘s Stemming Algorithm [52]

**Fig. 12**
Similar data items (URIs) obtained by our algorithm after adding a new dataset. This figure shows similar targets detected in the results retrieved after adding a new dataset to the “Find targets for the drug” template and running the rewritten predefined Federated SPARQL query. The results are shown in the form of a table on a new web page

**Fig. 13**
Matching results using methods Predicates selected and Predicates not selected. This figure shows a graphical representation of data from Table 6. The graphic contains the number of similar data items obtained using two approaches, Predicates selected and Predicates not selected, and similarity matching result based on human judgment (0 means that no matching exists, 1 means that a matching exists)

**Fig. 14**
Results of usability evaluation obtained from our questionnaire. This figure shows the final outcome of the survey carried out in cooperation with RC staff. For this survey, the six-item Likert scale-based System Usability (SUS) questionnaire was used. In order to numerically analyze the survey results, the Likert scale responses were translated to numbers using the following five-point scale: 1 = strongly disagree; 2 = disagree, 3 = neutral; 4 = agree; 5 = strongly agree. Based on the questionnaire outcome, average values (AVG) and standard deviation values (STD) were calculated and graphically presented

See this image and copyright information in PMC

Cited by

Enabling semantic queries across federated bioinformatics databases.
Sima AC, Mendes de Farias T, Zbinden E, Anisimova M, Gil M, Stockinger H, Stockinger K, Robinson-Rechavi M, Dessimoz C. Sima AC, et al. Database (Oxford). 2019 Jan 1;2019:baz106. doi: 10.1093/database/baz106. Database (Oxford). 2019. PMID: 31697362 Free PMC article.
Semantic Data Visualisation for Biomedical Database Catalogues.
Pereira A, Almeida JR, Lopes RP, Oliveira JL. Pereira A, et al. Healthcare (Basel). 2022 Nov 15;10(11):2287. doi: 10.3390/healthcare10112287. Healthcare (Basel). 2022. PMID: 36421611 Free PMC article.

References

1. Masseroli M, Mons B, Bongcam-Rudloff E, Ceri S, Kel A, Rechenmann F, Lisacek F, Romano P. Integrated bio-search: challenges and trends for the integration, search and comprehensive processing of biological information. BMC Bioinformatics. 2014;15(Suppl 1):S2. doi:10.1186/1471-2105-15-S1-S2. - PMC - PubMed
1. Stephens S, LaVigna D, DiLascio M, Luciano J. Aggregation of bioinformatics data using Semantic Web technology. Web Semantics: Science, services and agents on the world wide web. 2006 Sep 30; 4(3):216–221.
1. Stevens R, Bodenreider O, Lussier YA. Semantic webs for life sciences. In: Pacific symposium on Biocomputing. Pacific symposium on Biocomputing 2006 (p. 112). NIH Public Access. - PMC - PubMed
1. CPCTAS-LCMB, Faculty of Science, University of Kragujevac, Serbia, http://cpctas-lcmb.pmf.kg.ac.rs/lcmb/
1. Cvjetkovic V, Djokic M, Arsic B, Curcic M. The ontology supported intelligent system for experiment search in the scientific research center. Kragujevac Journal of Science. 2014;36:95–110. doi: 10.5937/KgJSci1436095C. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PIBAS FedSPARQL: a web-based platform for integration and exploration of bioinformatics datasets

Affiliations

PIBAS FedSPARQL: a web-based platform for integration and exploration of bioinformatics datasets

Authors

Affiliations

Abstract

Conflict of interest statement

Authors’ information

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials