Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr;129(4):942-959.
doi: 10.1007/s11263-020-01404-0. Epub 2021 Jan 5.

Adding Knowledge to Unsupervised Algorithms for the Recognition of Intent

Affiliations

Adding Knowledge to Unsupervised Algorithms for the Recognition of Intent

Stuart Synakowski et al. Int J Comput Vis. 2021 Apr.

Abstract

Computer vision algorithms performance are near or superior to humans in the visual problems including object recognition (especially those of fine-grained categories), segmentation, and 3D object reconstruction from 2D views. Humans are, however, capable of higher-level image analyses. A clear example, involving theory of mind, is our ability to determine whether a perceived behavior or action was performed intentionally or not. In this paper, we derive an algorithm that can infer whether the behavior of an agent in a scene is intentional or unintentional based on its 3D kinematics, using the knowledge of self-propelled motion, Newtonian motion and their relationship. We show how the addition of this basic knowledge leads to a simple, unsupervised algorithm. To test the derived algorithm, we constructed three dedicated datasets from abstract geometric animation to realistic videos of agents performing intentional and non-intentional actions. Experiments on these datasets show that our algorithm can recognize whether an action is intentional or not, even without training data. The performance is comparable to various supervised baselines quantitatively, with sensible intentionality segmentation qualitatively.

Keywords: Action Recognition; Commonsense; Intent; Theory of Mind; Unsupervised.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Recognizing intentional versus non-intentional actions. The six samples are from the three datasets introduced in Section 5.1. The colored horizontal bar underneath each image sequence denotes an intentionality function I(t) of the action. The yellow crosshair illustrates the 2D project of the location of agent’s center of mass. (a) intent-maya dataset. The intentional action showing a ball agent jumping down from a platform and climbing up a conveyor belt. In the non-intentional action, the ball moves according to the Newtonian physics. The transparent tail of the ball shows the location of the agent in the last second. (b) intent-mocap dataset. In the intentional action the agent jumps down from a (invisible) platform. In the non-intentional action the agent trips while walking. These snapshots of animation is directly extracted from www.mixamo.com. (c) intent-youtube dataset. In the intentional action, the agent successfully completed a board slide. In the non-intentional action, the agent falls at the end of an ollie.
Fig. 2
Fig. 2
Overview of the proposed algorithm. Here we illustrate the concepts we derive to model intentionality. (a) shows a logic diagram of the four concepts introduced in Section 3.2, and their relationship with intentionality. (b-e) shows a pair of samples from our dataset described in Section 5.1.1. The intentional example (in the blue box) shows a ball stepping down a ladder and jumping down an inclined platform to go to the isle at the far end of the scene. In the non-intentional example, the ball rolls and bounces according to Newtonian physics, with a trajectory that closely mimics that of the intentional action, yet the human eye is not trick by this and people clearly classify the first action as intentional and the second as non-intentional. (b) Result of our algorithm when only Concept 1 is considered; (c) result with Concepts 1 and 2; (d) result with Concepts 1, 2 and 3; (e) results with all four concepts included in our algorithm; (f) model overview. The proposed algorithm first extract change in total mechanical energy ∆E(t) and the vertical acceleration ay(t) from the input trajectory of the agent, p(t). Concept 1 recognizes intentional action from ∆E(t). Concept 2 takes ay(t) and the output of Concept 1 to form an understanding on non-intentional actions, which will be used in Concept 3 to update the decision. Finally, Concept 4 handles all the unknown state that is previously unrecognizable (see derivation in the main text of the paper for details).
Fig. 3
Fig. 3
An example where Concept 3 is necessary to achieve a correct classification of intentionallity. In this trajectory, an agent jumps twice. IC1, IC2 and IC3 is the output of Concepts 1, 2 and 3, respectively. At time ta1 and ta2, the agent adds positive energy into the system to initialize the jumps. Thus, the movement at these two time points is detected as intentional as shown in IC1. s1 and s2 are the two time intervals when the agent’s movement is induced only by gravity, i.e., free fall. Hence, the action in these two intervals is detected as non-intentional by IC2. However, since the free fall is part of the jump, the correct classification should be intentional. By taking into account causal relationship between action, Concept 3 can correctly classify these two movement as intentional, as shown in IC3.
Fig. 4
Fig. 4
Illustration of weight distribution used in calculating center of mass in the mocap and youtube datasets. The joints with solid color are used in both mocap and youtube skeleton template. The joints with diagonal pattern are only used in the mocap skeleton template.
Fig. 5
Fig. 5
Qualitative result of our algorithm testing on intent-maya dataset. The full model with all concepts is used. Blue (red) indicates that our algorithm recognizes the movement of the agent as intentional (non-intentional) at that specific time. The ground truth annotation is shown on the left of the figure.
Fig. 6
Fig. 6
Qualitative result of our algorithm testing on intent-mocap datasets. All samples shown here contains intentional actions. The full model with all concepts is used. The colorbar indicates the intentionality judgement by our algorithm at each frame, blue for intentional and red for non-intentional. The number above the agent is the corresponding frame index in the sequence. The action name is shown on the top-left corner of each sequence which corresponds to the animation name in mixamo dataset. We applied median filter on IC4 with windows size 30 to increase smoothness or the result.
Fig. 7
Fig. 7
Qualitative result of our algorithm testing on intent-mocap datasets. All samples shown here contains non-intentional actions. The full model with all concepts is used. The same method used in Figure 6 is applied to generated this images.
Fig. 8
Fig. 8
Qualitative result of our algorithm testing on intent-youtube datasets. Each sequence contains 10 samples uniformed sampled across time. The colorbar depicts intentionality judgement by our algorithm at each frame. Median filter with windows size 30 frames is also applied for visual presentation.

References

    1. Aditya S, Yang Y, Baral C, Fermuller C, Aloimonos Y: Visual commonsense for scene understanding using perception, semantic parsing and reasoning. In: 2015 AAAI Spring Symposium Series (2015)
    1. Aristotle F: The art of rhetoric, vol. 2. Harvard University Press; Cambridge, MA: (1926)
    1. Bahdanau D, Cho K, Bengio Y: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
    1. Cao Z, Hidalgo G, Simon T, Wei SE, Sheikh Y: OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. In: arXiv preprint arXiv:1812.08008 (2018) - PubMed
    1. Chambon V, Domenech P, Jacquet PO, Barbalat G, Bouton S, Pacherie E, Koechlin E, Farrer C: Neural coding of prior expectations in hierarchical intention inference. Scientific reports 7(1), 1278 (2017) - PMC - PubMed

LinkOut - more resources