Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 6:14:giaf045.
doi: 10.1093/gigascience/giaf045.

A large collection of bioinformatics question-query pairs over federated knowledge graphs: methodology and applications

Affiliations

A large collection of bioinformatics question-query pairs over federated knowledge graphs: methodology and applications

Jerven Bolleman et al. Gigascience. .

Abstract

Background: In recent decades, several life science resources have structured data using the same framework and made these accessible using the same query language to facilitate interoperability. Knowledge graphs have seen increased adoption in bioinformatics due to their advantages for representing data in a generic graph format. For example, yummydata.org catalogs more than 60 knowledge graphs accessible through SPARQL, a technical query language. Although SPARQL allows powerful, expressive queries, even across physically distributed knowledge graphs, formulating such queries is a challenge for most users. Therefore, to guide users in retrieving the relevant data, many of these resources provide representative examples. These examples can also be an important source of information for machine learning (for example, machine-learning algorithms for translating natural language questions to SPARQL), if a sufficiently large number of examples are provided and published in a common, machine-readable, and standardized format across different resources.

Findings: We introduce a large collection of human-written natural language questions and their corresponding SPARQL queries over federated bioinformatics knowledge graphs (KGs) collected for several years across different research groups at the SIB Swiss Institute of Bioinformatics. The collection comprises more than 1,000 example questions and queries, including almost 100 federated queries. We propose a methodology to uniformly represent the examples with minimal metadata, based on existing standards. Furthermore, we introduce an extensive set of open-source applications, including query graph visualizations and smart query editors, easily reusable by KG maintainers who adopt the proposed methodology.

Conclusions: We encourage the community to adopt and extend the proposed methodology, towards richer KG metadata and improved Semantic Web services. URL: https://github.com/sib-swiss/sparql-examples.

Keywords: Resource Description Framework (RDF); federated SPARQL; knowledge graphs; metadata; query editor.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Figure 1:
Figure 1:
Average number of triple patterns per query in each of the KGs contributing to the examples collection.
Figure 2:
Figure 2:
End-to-end workflow, from example contributions to services that uniformly consume examples across distinct SPARQL endpoint (e.g., sparql-editor and Bio-Query template search).
Figure 3:
Figure 3:
A graphical representation of [36] retrieving disease-related proteins that are known to be located within the cell. This visualization is available in the GitHub pages of the collection of SPARQL examples [37]. In the figure, green cells show the projected variables, i.e., the variables that are selected to be included in the final result. The small circles act as intermediary, undefined variables (i.e., intermediary blank nodes) introduced by decomposing property paths into individual triple patterns. The gray boxes shown as edge labels are the properties in the query triple patterns. In this figure, the numbered boxes shown on the left side of edges help to match the visualization to the corresponding triple patterns in the query shown on the right side.
Figure 4:
Figure 4:
SPARQL query editor with context-aware autocomplete for the UniProt SPARQL endpoint. The list of query examples, classes, and properties is automatically retrieved from the endpoint.
Figure 5:
Figure 5:
Example UniProt entry in the Bio-Query interface, integrated from the collection of SIB Example SPARQL queries. The collection is processed from the central GitHub repository in order to produce the JSON representation required by the interface. The standardization of the examples minimizes the effort to integrate these in the common interface, which can then act as a central hub to search for, and adapt, existing examples across SIB KGs.

References

    1. UniProt Consortium . UniProt: the Universal Protein knowledgebase in 2023. Nucleic Acids Res. 2023;51:D523–31. 10.1093/nar/gkac1052. - DOI - PMC - PubMed
    1. Sima A-C, de Farias TM. On the potential of artificial intelligence chatbots for data exploration of federated bioinformatics knowledge graphs. arXiv [cs.AI]. 2023; 10.48550/arXiv.2304.10427. - DOI
    1. Vollmers D, Srivastava N, Zahera HN. et al. UniQ-gen: unified query generation across multiple knowledge graphs. Knowledge Engineering and Knowledge Management. EKAW 2024. Lecture Notes in Computer Science, 2025. 10.1007/978-3-031-77792-9_11. - DOI
    1. Expasy overview of SIB Knowledge Graphs. [cited 20 Feb 2025]. https://www.expasy.org/search/SPARQL. Accessed 8 May 2025.
    1. Vrandečić D, Krötzsch M. Wikidata: a free collaborative knowledgebase. Commun ACM. 2014;57:78–85. 10.1145/2629489. - DOI

LinkOut - more resources