Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct 29;22(21):8318.
doi: 10.3390/s22218318.

Capturing Conversational Gestures for Embodied Conversational Agents Using an Optimized Kaneda-Lucas-Tomasi Tracker and Denavit-Hartenberg-Based Kinematic Model

Affiliations

Capturing Conversational Gestures for Embodied Conversational Agents Using an Optimized Kaneda-Lucas-Tomasi Tracker and Denavit-Hartenberg-Based Kinematic Model

Grega Močnik et al. Sensors (Basel). .

Abstract

In order to recreate viable and human-like conversational responses, the artificial entity, i.e., an embodied conversational agent, must express correlated speech (verbal) and gestures (non-verbal) responses in spoken social interaction. Most of the existing frameworks focus on intent planning and behavior planning. The realization, however, is left to a limited set of static 3D representations of conversational expressions. In addition to functional and semantic synchrony between verbal and non-verbal signals, the final believability of the displayed expression is sculpted by the physical realization of non-verbal expressions. A major challenge of most conversational systems capable of reproducing gestures is the diversity in expressiveness. In this paper, we propose a method for capturing gestures automatically from videos and transforming them into 3D representations stored as part of the conversational agent's repository of motor skills. The main advantage of the proposed method is ensuring the naturalness of the embodied conversational agent's gestures, which results in a higher quality of human-computer interaction. The method is based on a Kanade-Lucas-Tomasi tracker, a Savitzky-Golay filter, a Denavit-Hartenberg-based kinematic model and the EVA framework. Furthermore, we designed an objective method based on cosine similarity instead of a subjective evaluation of synthesized movement. The proposed method resulted in a 96% similarity.

Keywords: 3D gestures; Denavit–Hartenberg; Kanade–Lucas–Tomasi tracker; conversational gestures; embodied conversational agents; gesture reconstruction; kinematics; motor skills.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Workflow to capture conversational behavior in spontaneous discourse automatically, store it as gesture templates, and interconnect the captured templates with other verbal and nonverbal features of the observed sequence. The hand shape is not tracked; a CNN model was used to select the shape from a dictionary of possible shapes based on the HamNoSys notation system [48].
Figure 2
Figure 2
Example of (a) Good features as extracted by the Shi-Tomasi detector and (b) Tracking points as the strongest corners in a specific region, representing the “tracked” joints in the human skeleton.
Figure 3
Figure 3
Overview of the implementation of the pyramidal KLT Tracker.
Figure 4
Figure 4
Power spectrum analysis of a filtered and nonfiltered signal. The green curve represents the optimal filtered signal; the orange curve represents the nonfiltered signal.
Figure 5
Figure 5
Smoothing the raw tracking results with an SG filter.
Figure 6
Figure 6
Visualization of the kinematic model, i.e., the arm manipulator consisting of complex cylindrical joints, implementing degrees of freedom of a spherical joint utilized in the skeleton of the realization entity.
Figure 7
Figure 7
An example of procedural animation formulated in EVAScriptMarkup, each <sequence><parallel> represents configurations Pi, Pi+1 as the transition between two consecutive frames adjusted to the frame rate scaling. durationUp represents the duration of the transition, and is calculated as sizeofHt and the value represents the 3D configuration of the “joint” (movement controller) in Euler angles expressed in roll–pitch–yaw (HPR) notation.
Figure 8
Figure 8
The red and blue curves represent the third derivative (jerk) of the reconstructed signal (position) from EVAPose and OpenPose, respectively. The third derivative is shown as a function of the number of samples. The number and amplitude of jerks in this type of analysis show unnatural and high−energy concentrated spikes on the reconstructed signal. Only a section of the entire signal (120 samples/1380 samples) is shown for easier and better presentation of the reconstructed signal’s third derivative.

References

    1. Trujillo J.P., Simanova I., Bekkering H., Özyürek A. Communicative intent modulates production and comprehension of actions and gestures: A Kinect study. Cognition. 2018;180:38–51. doi: 10.1016/j.cognition.2018.04.003. - DOI - PubMed
    1. Kelly S.D., Özyürek A., Maris E. Two Sides of the Same Coin: Speech and Gesture Mutually Interact to Enhance Comprehension. Psychol. Sci. 2010;21:260–267. doi: 10.1177/0956797609357327. - DOI - PubMed
    1. Cassell J. Embodied Conversational Agents: Representation and Intelligence in User Interfaces. AI Mag. 2001;22:67. doi: 10.1609/aimag.v22i4.1593. - DOI
    1. Birdwhistell R.L. Kinesics and Context: Essays on Body Motion Communication. University of Pennsylvania Press; Philadelphia, PA, USA: 2010. - DOI
    1. ter Stal S., Kramer L.L., Tabak M., op den Akker H., Hermens H. Design Features of Embodied Conversational Agents in eHealth: A Literature Review. Int. J. Hum.-Comput. Stud. 2020;138:102409. doi: 10.1016/j.ijhcs.2020.102409. - DOI

LinkOut - more resources