Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022;40(2-3):409-440.
doi: 10.1007/s10619-022-07414-w. Epub 2022 Jul 16.

Bio-SODA UX: enabling natural language question answering over knowledge graphs with user disambiguation

Affiliations

Bio-SODA UX: enabling natural language question answering over knowledge graphs with user disambiguation

Ana Claudia Sima et al. Distrib Parallel Databases. 2022.

Abstract

The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at open-domain question answering using DBpedia, or require large training datasets to translate a natural language question to SPARQL in order to query the knowledge graph. Hence, these approaches often cannot be applied directly to complex scientific datasets where no prior training data is available. In this paper, we focus on the challenges of natural language processing over knowledge graphs of scientific datasets. In particular, we introduce Bio-SODA, a natural language processing engine that does not require training data in the form of question-answer pairs for generating SPARQL queries. Bio-SODA uses a generic graph-based approach for translating user questions to a ranked list of SPARQL candidate queries. Furthermore, Bio-SODA uses a novel ranking algorithm that includes node centrality as a measure of relevance for selecting the best SPARQL candidate query. Our experiments with real-world datasets across several scientific domains, including the official bioinformatics Question Answering over Linked Data (QALD) challenge, as well as the CORDIS dataset of European projects, show that Bio-SODA outperforms publicly available KGQA systems by an F1-score of least 20% and by an even higher factor on more complex bioinformatics datasets. Finally, we introduce Bio-SODA UX, a graphical user interface designed to assist users in the exploration of large knowledge graphs and in dynamically disambiguating natural language questions that target the data available in these graphs.

Keywords: Knowledge graphs; Question answering; Ranking.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Illustrative data model, simplified from the QALD4 benchmark datasets [17]. Consider the following question: “Which drugs are used for asthma?”. In the QALD4 dataset, “asthma” appears as both a disease instance (shown in green), as well as a side effect (shown in red). The second interpretation describes drugs that can trigger asthma symptoms. Therefore, it is the opposite of the user’s intended question. However, the predicate used for in the question cannot be easily linked to either of the properties indicated through arrows in the image. Due to ambiguity, the question is difficult to translate correctly in the absence of external knowledge, without relying on training data (inferring that used for implies drug targeting disease) (Color figure online)
Fig. 2
Fig. 2
Simplified data model based on the Bgee database and QALD4 [17] datasets. The data model is a multigraph, including disjoint properties – such as isAbsentIn and isExpressedIn, as well as inverse properties, such as possibleDiseaseTarget and possibleDrug. To make matters more complicated, a Side Effect and a Disease can be described by the same terms, with instances of the two classes being related via the sameAs property. As a result, even simple questions such as “which drugs might lead to strokes?” are hard to automatically translate correctly in the absence of external knowledge (i.e. “lead to” = “side effect”)
Fig. 3
Fig. 3
Simplified answer pipeline for the query “What are the drugs for diseases associated with the BRCA genes?”. For the sake of simplicity, PageRank scores are solely displayed when more than one match is found
Fig. 4
Fig. 4
Bio-SODA System Architecture
Fig. 5
Fig. 5
Bio-SODA UX interface for knowledge graph exploration and query disambiguation. The three main components of the interface are: 1) an input field which also provides drop-downs with example candidate matches for each searched concept; 2) the fraction of the data model relevant to the question, shown in graph form; clicking on any node will display additional information in the “Details” box on the right; 3) the results table with options to extend with more attributes related to the concepts in the question
Fig. 6
Fig. 6
Bio-SODA UX example use case for the question “drosophila anatomic entities at the embryo developmental stage”
Fig. 7
Fig. 7
Bio-SODA UX example use case for the question “genes with lung in the description”
Fig. 8
Fig. 8
Bio-SODA failure analysis. Out of the total 50 questions in the QALD4 biomedical benchmark, Bio-SODA cannot correctly answer 20. A further 12 out of 30 cannot be answered in the bioinformatics dataset, mainly due to query complexity (some queries having more than 10 triple patterns). Finally, on the CORDIS dataset 10 out of 30 queries cannot be answered, a large fraction of which include features currently unsupported in Bio-SODA: aggregations, comparatives, conjunctions etc

References

    1. Diefenbach D, Both A, Singh K, Maret P. Towards a question answering system over the semantic web. Semantic Web Preprint. 2018;2018:1–19.
    1. Zheng, W., Yu, J.X., Zou, L., Cheng, H.: Question answering over knowledge graphs: question understanding via template decomposition. In: Proceedings of the VLDB Endowment 11, pp. 1373–1386 (2018)
    1. Vakulenko, S., Garcia, J.D.F., Polleres, A., de Rijke, M., Cochez, M.: Message Passing for Complex Question Answering over Knowledge Graphs. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1431–1440 (2019)
    1. Li F, Jagadish HV. Constructing an interactive natural language interface for relational databases. Proc. VLDB Endowm. 2014;8:73–84. doi: 10.14778/2735461.2735468. - DOI
    1. Li F, Jagadish HV. Understanding natural language queries over relational databases. ACM SIGMOD Rec. 2016;45:6–13. doi: 10.1145/2949741.2949744. - DOI

LinkOut - more resources