Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021;54(1):755-810.
doi: 10.1007/s10462-020-09866-x. Epub 2020 Jun 25.

Survey on evaluation methods for dialogue systems

Affiliations

Survey on evaluation methods for dialogue systems

Jan Deriu et al. Artif Intell Rev. 2021.

Abstract

In this paper, we survey the methods and concepts developed for the evaluation of dialogue systems. Evaluation, in and of itself, is a crucial part during the development process. Often, dialogue systems are evaluated by means of human evaluations and questionnaires. However, this tends to be very cost- and time-intensive. Thus, much work has been put into finding methods which allow a reduction in involvement of human labour. In this survey, we present the main concepts and methods. For this, we differentiate between the various classes of dialogue systems (task-oriented, conversational, and question-answering dialogue systems). We cover each class by introducing the main technologies developed for the dialogue systems and then present the evaluation methods regarding that class.

Keywords: Chatbots; Conversational AI; Dialogue systems; Discourse model; Evaluation metrics.

PubMed Disclaimer

Conflict of interest statement

Conflict of interestThere are no conflicts of interest to disclose.

Figures

Fig. 1
Fig. 1
Example dialogue where the driver can query the agenda via a voice command (Eric et al. 2017). The dialogue system guides the driver through the various options
Fig. 2
Fig. 2
General overview of a task-oriented dialogue system
Fig. 3
Fig. 3
Overview of a DST module. The input to the DST module is the combined output of the ASR and the NLU model
Fig. 4
Fig. 4
Examples of goals from Schatzmann et al. (2007) and Walker et al. (1997). Where C0 denotes the information constraints, i.e. which information is to be retrieved (a bar that serves beer in the city center). R0 denotes the set of requests, i.e. the information the user wants (name, address, and phone number)
Fig. 5
Fig. 5
PARADISE overview (Schmitt and Ultes 2015)
Fig. 6
Fig. 6
Overview of the interaction quality procedure (Schmitt and Ultes 2015)
Fig. 7
Fig. 7
Overview of the HRED architecture. There are two levels of encoding: (i) the utterance encoder, which encodes a single utterance and (ii) the context encoder, which encodes the sequence of utterance encodings. The decoder is conditioned on the context encoding

References

    1. Adiwardana D, Luong MT, So DR, Hall J, Fiedel N, Thoppilan R, Yang Z, Kulshreshtha A, Nemade G, Lu Y, et al. (2020) Towards a human-like open-domain chatbot. arXiv preprint arXiv:200109977
    1. Ameixa D, Coheur L (2013) From subtitles to human interactions: introducing the SubTle Corpus. In: Technical report 2013
    1. Austin JL. How to do things with words. William James: Oxford University Press, Oxford; 1962.
    1. Banchs RE (2012) Movie-DiC: a Movie Dialogue Corpus for Research and Development. In: Proceedings of the 50th annual meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, pp 203–207
    1. Banchs RE, Li H (2012) IRIS: a chat-oriented dialogue system based on the vector space model. In: Proceedings of the ACL 2012 demonstrations, Jeju Island, Korea, pp 37–42

LinkOut - more resources