Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Sep 16;11(1):18462.
doi: 10.1038/s41598-021-98019-3.

Translating synthetic natural language to database queries with a polyglot deep learning framework

Affiliations

Translating synthetic natural language to database queries with a polyglot deep learning framework

Adrián Bazaga et al. Sci Rep. .

Abstract

The number of databases as well as their size and complexity is increasing. This creates a barrier to use especially for non-experts, who have to come to grips with the nature of the data, the way it has been represented in the database, and the specific query languages or user interfaces by which data are accessed. These difficulties worsen in research settings, where it is common to work with many different databases. One approach to improving this situation is to allow users to pose their queries in natural language. In this work we describe a machine learning framework, Polyglotter, that in a general way supports the mapping of natural language searches to database queries. Importantly, it does not require the creation of manually annotated data for training and therefore can be applied easily to multiple domains. The framework is polyglot in the sense that it supports multiple different database engines that are accessed with a variety of query languages, including SQL and Cypher. Furthermore Polyglotter supports multi-class queries. Good performance is achieved on both toy and real databases, as well as a human-annotated WikiSQL query set. Thus Polyglotter may help database maintainers make their resources more accessible.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Overview of the random query generation method used by Polyglotter to obtain training datasets for the sequence-to-sequence models. Classes and attributes are added through a random walk, and constraint operators and values are also selected randomly.
Figure 2
Figure 2
Graphical summary of the overall workflow for Polyglotter.
Figure 3
Figure 3
Test set performance as a function of training dataset size for HumanMine, MySQL and Neo4j (rows 1, 2 and 3 respectively). Left: overall test set performance allowing any of the top 1, 3 or 5 predictions from each model to match the corresponding test. Right: test set performance separately for each of the elements forming a query (attributes, classes and constraints) when using just the top prediction. The error bars show the standard deviation across ten independent datasets.
Figure 4
Figure 4
HumanMine performance as a function of number of classes in the query, as the number of training items varies, with k=1. Note that queries containing five classes were absent from the training set.

References

    1. Affolter K, Stockinger K, Bernstein A. A comparative survey of recent natural language interfaces for databases. VLDB J. 2019;28(5):793–819. doi: 10.1007/s00778-019-00567-8. - DOI
    1. Dar, H. S., Lali, M. I., Ul Din, M., Malik, K. M., & Bukhari, S. A. C. Frameworks for querying databases using natural language: A literature review (2019). arXiv:1909.01822.
    1. Reshma, E. U. & Remya, P. C. A review of different approaches in natural language interfaces to databases. in 2017 International Conference on Intelligent Sustainable Systems (ICISS), 801–804 (IEEE, 2017).
    1. Ozcan, F., Quamar, A., Sen, J., Lei, C. & Efthymiou, V. State of the art and open challenges in natural language interfaces to data. in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. SIGMOD ’20 Series, 2629–2636 (Association for Computing Machinery, 2020).
    1. Blunschi L, Jossen C, Kossmann D, Mori M, Stockinger K. SODA: Generating SQL for business users. Proc. VLDB Endow. 2012;5(10):932–943. doi: 10.14778/2336664.2336667. - DOI

Publication types