Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023;47(2):249-265.
doi: 10.1007/s10514-022-10074-5. Epub 2022 Dec 3.

That was not what I was aiming at! Differentiating human intent and outcome in a physically dynamic throwing task

Affiliations

That was not what I was aiming at! Differentiating human intent and outcome in a physically dynamic throwing task

Vidullan Surendran et al. Auton Robots. 2023.

Abstract

Recognising intent in collaborative human robot tasks can improve team performance and human perception of robots. Intent can differ from the observed outcome in the presence of mistakes which are likely in physically dynamic tasks. We created a dataset of 1227 throws of a ball at a target from 10 participants and observed that 47% of throws were mistakes with 16% completely missing the target. Our research leverages facial images capturing the person's reaction to the outcome of a throw to predict when the resulting throw is a mistake and then we determine the actual intent of the throw. The approach we propose for outcome prediction performs 38% better than the two-stream architecture used previously for this task on front-on videos. In addition, we propose a 1D-CNN model which is used in conjunction with priors learned from the frequency of mistakes to provide an end-to-end pipeline for outcome and intent recognition in this throwing task.

Keywords: Computer vision; Human robot interaction; Intent recognition; Surface cues.

PubMed Disclaimer

Conflict of interest statement

Conflict of interestThe authors declare that they have no affiliations with or involvement in any organization or entity with any conflict of interest regarding the subject matter or materials discussed in this manuscript.

Figures

Fig. 1
Fig. 1
Throwing task setup with relevant dimensions
Fig. 2
Fig. 2
3×3 target grid with dimensions and zone labels
Fig. 3
Fig. 3
Dual camera setup consisting of the Intel D435 and a Pi camera fitted with a 50 mm lens. The Raspberry Pi 4 used to process images from the Pi Camera can be seen vertically mounted behind the cameras
Fig. 4
Fig. 4
Probability of intent target given observed outcome target in the presence of a mistake, i.e. the subject misses the target they aimed at. ‘Missed’ refers to the subject missing the target grid entirely, i.e. did not hit any of the 9 zones
Fig. 5
Fig. 5
Top left graph shows the filtered X,Y values of the throwing wrist along with the composite score for a sample captured by the 0 D435 camera. The score was scaled for graphing and the maximum value was observed at frame 77 denoting the throw frame. Bottom left shows the raw, interpolated, and filtered throwing wrist Y coordinate values illustrating the effect of preprocessing discussed in Sect. 4.1. Right shows the throw frame from all 6 cameras
Fig. 6
Fig. 6
Top shows the LSTM model used to classify 2D pose data into one of 9 outcome classes. ‘Batch’ refers to the variable data batch size used during training/inference. Bottom shows the Multi-branch 1D CNN model used to detect congruence between outcome and intent using features from a pre-trained emotion model. A and B denote two input branches whereas C is the concatenated branch. The layer parameters are shown in Table 4
Fig. 7
Fig. 7
Mean accuracy over the 5-folds for each target zone. Ordered left to right showing grids for the 0, 45, and 90 D435 camera views
Fig. 8
Fig. 8
Anonymized image showing the position of the ball in (Li et al., 2020) for each of the 9 target zones when thrown by a single participant. Even with the naked eye one can differentiate between the outcome targets especially which column of the target grid the ball might strike. Image at the top left represents target zone 1, while the image at the bottom right shows target zone 9
Fig. 9
Fig. 9
Position of the ball for each of the 9 target zones for a single participant in our dataset showing the difficulty of determining the outcome target from the ball position in the frame. Image at the top left represents target zone 1, while the image at the bottom right shows target zone 9

References

    1. Akilan T, Wu QJ, Safaei A, Huo J, Yang Y. A 3D CNN-LSTM-based image-to-image foreground segmentation. IEEE Transactions on Intelligent Transportation Systems. 2019;21(3):959–971. doi: 10.1109/TITS.2019.2900426. - DOI
    1. Alikhani, M., Khalid, B., Shome, R., Mitash, C., Bekris, K. E., & Stone, M. (2020). That and there: Judging the intent of pointing actions with robotic arms. In AAAI (pp. 10343–10351).
    1. Arriaga, O., Valdenegro-Toro, M., & Plöger, P. (2017). Real-time convolutional neural networks for emotion and gender classification. Preprint arXiv:1710.07557
    1. Cheuk T. Can AI be racist? Color-evasiveness in the application of machine learning to science assessments. Science Education. 2021;105(5):825–836. doi: 10.1002/sce.21671. - DOI
    1. Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6. doi: 10.1186/s12864-019-6413-7. - DOI - PMC - PubMed

LinkOut - more resources