Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Oct 18:4:670009.
doi: 10.3389/frai.2021.670009. eCollection 2021.

Assessing Open-Ended Human-Computer Collaboration Systems: Applying a Hallmarks Approach

Affiliations
Review

Assessing Open-Ended Human-Computer Collaboration Systems: Applying a Hallmarks Approach

Robyn Kozierok et al. Front Artif Intell. .

Abstract

There is a growing desire to create computer systems that can collaborate with humans on complex, open-ended activities. These activities typically have no set completion criteria and frequently involve multimodal communication, extensive world knowledge, creativity, and building structures or compositions through multiple steps. Because these systems differ from question and answer (Q&A) systems, chatbots, and simple task-oriented assistants, new methods for evaluating such collaborative computer systems are needed. Here, we present a set of criteria for evaluating these systems, called Hallmarks of Human-Machine Collaboration. The Hallmarks build on the success of heuristic evaluation used by the user interface community and past evaluation techniques used in the spoken language and chatbot communities. They consist of observable characteristics indicative of successful collaborative communication, grouped into eight high-level properties: robustness; habitability; mutual contribution of meaningful content; context-awareness; consistent human engagement; provision of rationale; use of elementary concepts to teach and learn new concepts; and successful collaboration. We present examples of how we used these Hallmarks in the DARPA Communicating with Computers (CwC) program to evaluate diverse activities, including story and music generation, interactive building with blocks, and exploration of molecular mechanisms in cancer. We used the Hallmarks as guides for developers and as diagnostics, assessing systems with the Hallmarks to identify strengths and opportunities for improvement using logs from user studies, surveying the human partner, third-party review of creative products, and direct tests. Informal feedback from CwC technology developers indicates that the use of the Hallmarks for program evaluation helped guide development. The Hallmarks also made it possible to identify areas of progress and major gaps in developing systems where the machine is an equal, creative partner.

Keywords: assessment; collaborative assistants; dialogue; evaluation; human-machine teaming; multimodal.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Initiative, autonomy, and shaping: Machine partners (A) take initiative, (B) accept offered autonomy, and in (C) and (D) help shape the human partners’ utterances. As described in the text, these examples demonstrate Hallmarks MC-4, MC-5, HA-1, and HA-2. The text for panel (A) was provided by the Bob with Bioagents system. The text for panel (B) was provided by Paul Cohen (Paul Cohen, personal communication, 2017). The text for panel (C) was provided by the CLARE system. The text for panel (D) was provided by the MUSICA system.
FIGURE 2
FIGURE 2
Undue distraction: Early versions of two different CwC systems demonstrate negative examples of Hallmark HE-4: The machine communicates without creating undue distraction. (A) A distracting (hard to understand) machine-generated description of a constraint learned by an early version of the CABOT system (Perera et al., 2018). (B) A distracting (physically impossible) machine-generated visualization in an early version of the Diana system (Pustejovsky 2018).
FIGURE 3
FIGURE 3
Awareness of the evolving situation: DAVID is an agent that works with a human user partner to manipulate blocks and answer spatial and temporal questions about the evolving blocks world scene. Here the system demonstrates Context-awareness Hallmark CA-14: The machine responds appropriately to human references and actions in the context of the evolving situation (includes anything built and pieces available) by correctly answering questions about the locations of blocks both before and after the block moves shown. This dialogue takes place at the time of the third photo. The human partner’s utterances are spoken, and the log excerpt here shows the computer’s interpretation of the speech. The images and associated log text were provided by the DAVID system (Georgiy Platonov, unpublished data, 2020).
FIGURE 4
FIGURE 4
Multi-modal interaction with asynchrony: As Diana reaches for the white block, the human partner interrupts her by pointing to the blue one and asking her to grab it instead. (The human is pointing from his perspective at the place on the virtual table indicated by the purple outline.) As described in the text, this example demonstrates Hallmarks RO-9, CA-6, MC-2, HE-1, and HE-2. The image was provided by the Brandeis Diana team (Krishnaswamy and Pustejovsky, email to the authors, February 5, 2021).
FIGURE 5
FIGURE 5
Out-of-vocabulary (OOV) detection and management: As described in the text, these examples show a collaborative story writing system dealing with out-of-vocabulary terminology, demonstrating Hallmarks CA-4, CA-5, and CA-9. The examples were provided by the ISI human-computer collaborative storytelling system (Goldfarb-Tarrant et al., 2019; Yao et al., 2019).
FIGURE 6
FIGURE 6
Building new concepts from known concepts: When the TRIPS system encounters an unknown word, it is able to look the word up in a dictionary, and based on the definition, it extends its ontology with a new concept, including axioms that relate the new concept to existing concepts and build lexical entries for the word that enables sentences including the new word to be understood. The figure shows how the word “ayete”, listed as a new word in the Oxford dictionary in 2019, is processed, adding the new understanding that represents the new word “ayete” in terms of known concepts “perceive” and “understand”. This demonstrates Hallmark EC-3. The image was provided by IHMC (James Allen, email to the authors, February 5, 2021) to illustrate how this capability works.
FIGURE 7
FIGURE 7
Robustness assessment of a simple interaction in the Aesop system: On the left of the figure are human utterances (in quotation marks) paired with descriptions (in square brackets) of the state of the screen where the characters and scene are being built. In addition to the Hallmarks annotated in the image, this interaction demonstrates an overall success on Hallmark RO-1.
FIGURE 8
FIGURE 8
Sample chart generated by the authors from results of Participant Surveys from IHMC User Study (Ian Perera, unpublished data, July 2020) on CABOT Blocks World System (Perera et al., 2018). Survey items are shown along with their aligned Hallmarks and participant assessments. Eight individuals participated in this study.
FIGURE 9
FIGURE 9
(A) Final story generated by the Writing with Artificial Intelligence system, where turquoise shading represents system suggestions selected by the human partner and present in the final story, and underlining indicates word substitution by the human partner. (Hallmark MC-6: The machine makes meaningful contributions to the interaction.) (B) Chart of participant survey responses.
FIGURE 10
FIGURE 10
An example of the coding of a composition by conversation human-computer interaction log for robustness. Green blocks represent successful interactions between the human partner and the system. Red blocks indicate those interactions where the machine did not respond appropriately. Hallmarks satisfied by the exchange as well as Hallmarks that aren’t satisfied but might be potential opportunities for improvement are indicated under the “Hallmark Assessment” column.
FIGURE 11
FIGURE 11
Hallmark successes and opportunities in a researcher session with Bob. Each block outlined in black marks a single or set of utterances aimed at obtaining a particular kind of information or achieving a particular action. Colors indicate the assessment of each Bob response: dark green represents appropriate biological answer or action; light green represents a helpful suggestion; yellow represents an unhelpful or misleading response for which the human partner later found a way to obtain what they were looking for; red represents an unhelpful or misleading response to a query for which the human partner never obtained an answer. Bubbles show selected single exchanges from the session.

References

    1. Abowd G. D., Dey A. K., Brown P. J., Davies N., Smith M., Steggles P. (1999). “Towards a Better Understanding of Context and Context-Awareness,” in Proceedings of the 1st International Symposium on Handheld and Ubiquitous Computing, Karlsruhe, Germany, September 27–29, 1999 (Berlin, Heidelberg: Springer-Verlag; ), 304–307. HUC ’99. 10.1007/3-540-48157-5_29 - DOI
    1. Adamopoulou E., Moussiades L. (2020). Chatbots: History, Technology, and Applications. Machine Learn. Appl. 2 (December), 100006. 10.1016/j.mlwa.2020.100006 - DOI
    1. Allen J., Hannah A., Bose R., de Beaumont W., Teng C. M. (2020). “A Broad-Coverage Deep Semantic Lexicon for Verbs,” in Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, May 11–16, 2020 (Marseille, France: European Language Resources Association; ), 3243–3251. https://www.aclweb.org/anthology/2020.lrec-1.396.
    1. Amershi S., Weld D., Vorvoreanu M., Fourney A., Nushi B., Collisson P., et al. (2019). “Guidelines for Human-AI Interaction,” in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Glasgow, United Kingdom, May 4–9, 2019 (New York, NY, USA: Association for Computing Machinery; ), 1–13. CHI ’19. 10.1145/3290605.3300233 - DOI
    1. Ammari T., Kaye J., Tsai J. Y., Bentley F. (2019). Music, Search, and IoT: How 922 People (Really) Use Voice Assistants. ACM Trans. Comput.-Hum. Interact. 26 (3), 1–28. 10.1145/3311956 - DOI

LinkOut - more resources