Bio-SODA UX: enabling natural language question answering over knowledge graphs with user disambiguation

Ana Claudia Sima¹, Tarcisio Mendes de Farias^{1

2

3}, Maria Anisimova^{1

4}, Christophe Dessimoz^{1

2

5

6}, Marc Robinson-Rechavi^{1

3}, Erich Zbinden^{1

4}, Kurt Stockinger⁴

Affiliations

¹ SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.
² Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
³ Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland.
⁴ ZHAW Zurich University of Applied Sciences, Zurich, Switzerland.
⁵ Department of Genetics, Evolution, and Environment, University College London, London, UK.
⁶ Department of Computer Science, University College London, London, UK.

PMID: 36097541
PMCID: PMC9458692
DOI: 10.1007/s10619-022-07414-w

Bio-SODA UX: enabling natural language question answering over knowledge graphs with user disambiguation

Ana Claudia Sima et al. Distrib Parallel Databases. 2022.

. 2022;40(2-3):409-440.

doi: 10.1007/s10619-022-07414-w. Epub 2022 Jul 16.

Authors

Ana Claudia Sima¹, Tarcisio Mendes de Farias^{1

2

3}, Maria Anisimova^{1

4}, Christophe Dessimoz^{1

2

5

6}, Marc Robinson-Rechavi^{1

3}, Erich Zbinden^{1

4}, Kurt Stockinger⁴

Affiliations

¹ SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.
² Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
³ Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland.
⁴ ZHAW Zurich University of Applied Sciences, Zurich, Switzerland.
⁵ Department of Genetics, Evolution, and Environment, University College London, London, UK.
⁶ Department of Computer Science, University College London, London, UK.

PMID: 36097541
PMCID: PMC9458692
DOI: 10.1007/s10619-022-07414-w

Abstract

The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at open-domain question answering using DBpedia, or require large training datasets to translate a natural language question to SPARQL in order to query the knowledge graph. Hence, these approaches often cannot be applied directly to complex scientific datasets where no prior training data is available. In this paper, we focus on the challenges of natural language processing over knowledge graphs of scientific datasets. In particular, we introduce Bio-SODA, a natural language processing engine that does not require training data in the form of question-answer pairs for generating SPARQL queries. Bio-SODA uses a generic graph-based approach for translating user questions to a ranked list of SPARQL candidate queries. Furthermore, Bio-SODA uses a novel ranking algorithm that includes node centrality as a measure of relevance for selecting the best SPARQL candidate query. Our experiments with real-world datasets across several scientific domains, including the official bioinformatics Question Answering over Linked Data (QALD) challenge, as well as the CORDIS dataset of European projects, show that Bio-SODA outperforms publicly available KGQA systems by an F1-score of least 20% and by an even higher factor on more complex bioinformatics datasets. Finally, we introduce Bio-SODA UX, a graphical user interface designed to assist users in the exploration of large knowledge graphs and in dynamically disambiguating natural language questions that target the data available in these graphs.

Keywords: Knowledge graphs; Question answering; Ranking.

PubMed Disclaimer

Figures

**Fig. 1**
Illustrative data model, simplified from the QALD4 benchmark datasets [17]. Consider the following question: *“Which drugs are used for asthma?”*. In the QALD4 dataset, “asthma” appears as both a disease instance (shown in green), as well as a side effect (shown in red). The second interpretation describes drugs that can *trigger* asthma symptoms. Therefore, it is the opposite of the user’s intended question. However, the predicate *used for* in the question cannot be easily linked to either of the properties indicated through arrows in the image. Due to ambiguity, the question is difficult to translate correctly in the absence of external knowledge, without relying on training data (inferring that *used for* implies drug targeting disease) (Color figure online)

**Fig. 2**
Simplified data model based on the Bgee database and QALD4 [17] datasets. The data model is a multigraph, including disjoint properties – such as *isAbsentIn* and *isExpressedIn*, as well as inverse properties, such as *possibleDiseaseTarget* and *possibleDrug*. To make matters more complicated, a *Side Effect* and a *Disease* can be described by the same terms, with instances of the two classes being related via the *sameAs* property. As a result, even simple questions such as *“which drugs might lead to strokes?”* are hard to automatically translate correctly in the absence of external knowledge (*i.e.* “lead to” = “side effect”)

**Fig. 3**
Simplified answer pipeline for the query “*What are the drugs for diseases associated with the BRCA genes?*”. For the sake of simplicity, PageRank scores are solely displayed when more than one match is found

**Fig. 5**
Bio-SODA UX interface for knowledge graph exploration and query disambiguation. The three main components of the interface are: 1) an input field which also provides drop-downs with example candidate matches for each searched concept; 2) the fraction of the data model relevant to the question, shown in graph form; clicking on any node will display additional information in the “Details” box on the right; 3) the results table with options to extend with more attributes related to the concepts in the question

**Fig. 6**
Bio-SODA UX example use case for the question “drosophila anatomic entities at the embryo developmental stage”

**Fig. 7**
Bio-SODA UX example use case for the question “genes with lung in the description”

**Fig. 8**
Bio-SODA failure analysis. Out of the total 50 questions in the QALD4 biomedical benchmark, Bio-SODA cannot correctly answer 20. A further 12 out of 30 cannot be answered in the bioinformatics dataset, mainly due to query complexity (some queries having more than 10 triple patterns). Finally, on the CORDIS dataset 10 out of 30 queries cannot be answered, a large fraction of which include features currently unsupported in Bio-SODA: aggregations, comparatives, conjunctions etc

See this image and copyright information in PMC

References

1. Diefenbach D, Both A, Singh K, Maret P. Towards a question answering system over the semantic web. Semantic Web Preprint. 2018;2018:1–19.
1. Zheng, W., Yu, J.X., Zou, L., Cheng, H.: Question answering over knowledge graphs: question understanding via template decomposition. In: Proceedings of the VLDB Endowment 11, pp. 1373–1386 (2018)
1. Vakulenko, S., Garcia, J.D.F., Polleres, A., de Rijke, M., Cochez, M.: Message Passing for Complex Question Answering over Knowledge Graphs. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1431–1440 (2019)
1. Li F, Jagadish HV. Constructing an interactive natural language interface for relational databases. Proc. VLDB Endowm. 2014;8:73–84. doi: 10.14778/2735461.2735468. - DOI
1. Li F, Jagadish HV. Understanding natural language queries over relational databases. ACM SIGMOD Rec. 2016;45:6–13. doi: 10.1145/2949741.2949744. - DOI

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Bio-SODA UX: enabling natural language question answering over knowledge graphs with user disambiguation

Affiliations

Bio-SODA UX: enabling natural language question answering over knowledge graphs with user disambiguation

Authors

Affiliations

Abstract

Figures

References

LinkOut - more resources

Full Text Sources