A large collection of bioinformatics question-query pairs over federated knowledge graphs: methodology and applications

Jerven Bolleman¹, Vincent Emonet¹, Adrian Altenhoff¹, Amos Bairoch¹, Marie-Claude Blatter¹, Alan Bridge¹, Séverine Duvaud¹, Elisabeth Gasteiger¹, Dmitry Kuznetsov¹, Sébastien Moretti¹, Pierre-Andre Michel¹, Anne Morgat¹, Marco Pagni¹, Nicole Redaschi¹, Monique Zahn-Zabal¹, Tarcisio Mendes de Farias¹, Ana Claudia Sima¹

Affiliations

PMID: 40378136
PMCID: PMC12083453
DOI: 10.1093/gigascience/giaf045

A large collection of bioinformatics question-query pairs over federated knowledge graphs: methodology and applications

Jerven Bolleman et al. Gigascience. 2025.

. 2025 Jan 6:14:giaf045.

doi: 10.1093/gigascience/giaf045.

Authors

Affiliation

¹ SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland.

PMID: 40378136
PMCID: PMC12083453
DOI: 10.1093/gigascience/giaf045

Abstract

Background: In recent decades, several life science resources have structured data using the same framework and made these accessible using the same query language to facilitate interoperability. Knowledge graphs have seen increased adoption in bioinformatics due to their advantages for representing data in a generic graph format. For example, yummydata.org catalogs more than 60 knowledge graphs accessible through SPARQL, a technical query language. Although SPARQL allows powerful, expressive queries, even across physically distributed knowledge graphs, formulating such queries is a challenge for most users. Therefore, to guide users in retrieving the relevant data, many of these resources provide representative examples. These examples can also be an important source of information for machine learning (for example, machine-learning algorithms for translating natural language questions to SPARQL), if a sufficiently large number of examples are provided and published in a common, machine-readable, and standardized format across different resources.

Findings: We introduce a large collection of human-written natural language questions and their corresponding SPARQL queries over federated bioinformatics knowledge graphs (KGs) collected for several years across different research groups at the SIB Swiss Institute of Bioinformatics. The collection comprises more than 1,000 example questions and queries, including almost 100 federated queries. We propose a methodology to uniformly represent the examples with minimal metadata, based on existing standards. Furthermore, we introduce an extensive set of open-source applications, including query graph visualizations and smart query editors, easily reusable by KG maintainers who adopt the proposed methodology.

Conclusions: We encourage the community to adopt and extend the proposed methodology, towards richer KG metadata and improved Semantic Web services. URL: https://github.com/sib-swiss/sparql-examples.

Keywords: Resource Description Framework (RDF); federated SPARQL; knowledge graphs; metadata; query editor.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Figure 1:**
Average number of triple patterns per query in each of the KGs contributing to the examples collection.

**Figure 2:**
End-to-end workflow, from example contributions to services that uniformly consume examples across distinct SPARQL endpoint (e.g., sparql-editor and Bio-Query template search).

**Figure 3:**
A graphical representation of [36] retrieving disease-related proteins that are known to be located within the cell. This visualization is available in the GitHub pages of the collection of SPARQL examples [37]. In the figure, green cells show the projected variables, i.e., the variables that are selected to be included in the final result. The small circles act as intermediary, undefined variables (i.e., intermediary blank nodes) introduced by decomposing property paths into individual triple patterns. The gray boxes shown as edge labels are the properties in the query triple patterns. In this figure, the numbered boxes shown on the left side of edges help to match the visualization to the corresponding triple patterns in the query shown on the right side.

**Figure 4:**
SPARQL query editor with context-aware autocomplete for the UniProt SPARQL endpoint. The list of query examples, classes, and properties is automatically retrieved from the endpoint.

**Figure 5:**
Example UniProt entry in the Bio-Query interface, integrated from the collection of SIB Example SPARQL queries. The collection is processed from the central GitHub repository in order to produce the JSON representation required by the interface. The standardization of the examples minimizes the effort to integrate these in the common interface, which can then act as a central hub to search for, and adapt, existing examples across SIB KGs.

See this image and copyright information in PMC

References

1. UniProt Consortium . UniProt: the Universal Protein knowledgebase in 2023. Nucleic Acids Res. 2023;51:D523–31. 10.1093/nar/gkac1052. - DOI - PMC - PubMed
1. Sima A-C, de Farias TM. On the potential of artificial intelligence chatbots for data exploration of federated bioinformatics knowledge graphs. arXiv [cs.AI]. 2023; 10.48550/arXiv.2304.10427. - DOI
1. Vollmers D, Srivastava N, Zahera HN. et al. UniQ-gen: unified query generation across multiple knowledge graphs. Knowledge Engineering and Knowledge Management. EKAW 2024. Lecture Notes in Computer Science, 2025. 10.1007/978-3-031-77792-9_11. - DOI
1. Expasy overview of SIB Knowledge Graphs. [cited 20 Feb 2025]. https://www.expasy.org/search/SPARQL. Accessed 8 May 2025.
1. Vrandečić D, Krötzsch M. Wikidata: a free collaborative knowledgebase. Commun ACM. 2014;57:78–85. 10.1145/2629489. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- PubMed Central
- Silverchair Information Systems

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A large collection of bioinformatics question-query pairs over federated knowledge graphs: methodology and applications

Affiliation

A large collection of bioinformatics question-query pairs over federated knowledge graphs: methodology and applications

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources