Joint-Based Action Progress Prediction

Davide Pucci¹, Federico Becattini^{1

2}, Alberto Del Bimbo¹

Affiliations

¹ Media Integration and Communication Center (MICC), University of Florence, 50124 Firenze, Italy.
² Dipartimento Di Ingegneria Dell'Informazione E Scienze Matematiche, University of Siena, 53100 Siena, Italy.

PMID: 36617115
PMCID: PMC9824535
DOI: 10.3390/s23010520

Joint-Based Action Progress Prediction

Davide Pucci et al. Sensors (Basel). 2023.

. 2023 Jan 3;23(1):520.

doi: 10.3390/s23010520.

Authors

Davide Pucci¹, Federico Becattini^{1

2}, Alberto Del Bimbo¹

Affiliations

¹ Media Integration and Communication Center (MICC), University of Florence, 50124 Firenze, Italy.
² Dipartimento Di Ingegneria Dell'Informazione E Scienze Matematiche, University of Siena, 53100 Siena, Italy.

PMID: 36617115
PMCID: PMC9824535
DOI: 10.3390/s23010520

Abstract

Action understanding is a fundamental computer vision branch for several applications, ranging from surveillance to robotics. Most works deal with localizing and recognizing the action in both time and space, without providing a characterization of its evolution. Recent works have addressed the prediction of action progress, which is an estimate of how far the action has advanced as it is performed. In this paper, we propose to predict action progress using a different modality compared to previous methods: body joints. Human body joints carry very precise information about human poses, which we believe are a much more lightweight and effective way of characterizing actions and therefore their execution. Estimating action progress can in fact be determined based on the understanding of how key poses follow each other during the development of an activity. We show how an action progress prediction model can exploit body joints and integrate it with modules providing keypoint and action information in order to be run directly from raw pixels. The proposed method is experimentally validated on the Penn Action Dataset.

Keywords: action progress prediction; body joints; body pose.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
Proposed architecture for localization, action classification, and progress estimation. Blocks in purple are the outputs produced.

**Figure 2**
Action Progress Prediction architecture.

**Figure 3**
Output of UniPose when an example image is processed. Below are available (**left** to **right**) the heatmaps produced for the head position, the **left elbow** and the **right elbow**.

**Figure 4**
Example of annotated frame in Penn Action dataset.

**Figure 5**
(**left**) Number of videos per class in train and test sets; (**right**) length distribution of Penn Action videos in frames.

**Figure 6**
Proposed architecture for classification based on InceptionV3 backbone.

**Figure 7**
Per-class Mean Absolute Errors when using ground truth joints and ground truth class labels (in blue), when using extracted joints and ground truth class labels (in steelblue) and when everything is extracted by the architecture (in lightblue).

**Figure 8**
Some examples of the outputs produced by the overall architecture. The first 4 videos are taken from Penn Action dataset whilst the last 2 are taken from YouTube at https://www.youtube.com/watch?v=UgKaDSA3uIg (accessed on 29 December 2022) and at https://www.youtube.com/watch?v=IODxDxX7oi4 (accessed on 29 December 2022).

See this image and copyright information in PMC

References

1. Mabrouk A.B., Zagrouba E. Abnormal behavior recognition for intelligent video surveillance systems: A review. Expert Syst. Appl. 2018;91:480–491. doi: 10.1016/j.eswa.2017.09.029. - DOI
1. Han Y., Zhang P., Zhuo T., Huang W., Zhang Y. Going deeper with two-stream ConvNets for action recognition in video surveillance. Pattern Recognit. Lett. 2018;107:83–90. doi: 10.1016/j.patrec.2017.08.015. - DOI
1. Le Q.V., Zou W.Y., Yeung S.Y., Ng A.Y. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis; Proceedings of the CVPR 2011; Colorado Springs, CO, USA. 20–25 June 2011; pp. 3361–3368.
1. Turchini F., Seidenari L., Del Bimbo A. Understanding and localizing activities from correspondences of clustered trajectories. Comput. Vis. Image Underst. 2017;159:128–142. doi: 10.1016/j.cviu.2016.11.007. - DOI
1. Yuan H., Ni D., Wang M. Spatio-temporal dynamic inference network for group activity recognition; Proceedings of the IEEE/CVF International Conference on Computer Vision; Montreal, BC, Canada. 11–17 October 2021; pp. 7476–7485.

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Joint-Based Action Progress Prediction

Affiliations

Joint-Based Action Progress Prediction

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources