Review

. 2021 Dec 31;13(1):72.

doi: 10.3390/mi13010072.

Visual Feature Learning on Video Object and Human Action Detection: A Systematic Review

Dengshan Li^{1

2

3}, Rujing Wang^{1

3}, Peng Chen⁴, Chengjun Xie^{1

3}, Qiong Zhou^{1

2

3}, Xiufang Jia¹

Affiliations

¹ Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China.
² Science Island Branch of Graduate School, University of Science and Technology of China, Hefei 230026, China.
³ Intelligent Agriculture Engineering Laboratory of Anhui Province, Hefei 230031, China.
⁴ School of Computer Science and Technology, Anhui University, Hefei 230601, China.

PMID: 35056238
PMCID: PMC8781209
DOI: 10.3390/mi13010072

Review

Visual Feature Learning on Video Object and Human Action Detection: A Systematic Review

Dengshan Li et al. Micromachines (Basel). 2021.

. 2021 Dec 31;13(1):72.

doi: 10.3390/mi13010072.

Authors

Dengshan Li^{1

2

3}, Rujing Wang^{1

3}, Peng Chen⁴, Chengjun Xie^{1

3}, Qiong Zhou^{1

2

3}, Xiufang Jia¹

Affiliations

¹ Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China.
² Science Island Branch of Graduate School, University of Science and Technology of China, Hefei 230026, China.
³ Intelligent Agriculture Engineering Laboratory of Anhui Province, Hefei 230031, China.
⁴ School of Computer Science and Technology, Anhui University, Hefei 230601, China.

PMID: 35056238
PMCID: PMC8781209
DOI: 10.3390/mi13010072

Abstract

Video object and human action detection are applied in many fields, such as video surveillance, face recognition, etc. Video object detection includes object classification and object location within the frame. Human action recognition is the detection of human actions. Usually, video detection is more challenging than image detection, since video frames are often more blurry than images. Moreover, video detection often has other difficulties, such as video defocus, motion blur, part occlusion, etc. Nowadays, the video detection technology is able to implement real-time detection, or high-accurate detection of blurry video frames. In this paper, various video object and human action detection approaches are reviewed and discussed, many of them have performed state-of-the-art results. We mainly review and discuss the classic video detection methods with supervised learning. In addition, the frequently-used video object detection and human action recognition datasets are reviewed. Finally, a summarization of the video detection is represented, e.g., the video object and human action detection methods could be classified into frame-by-frame (frame-based) detection, extracting-key-frame detection and using-temporal-information detection; the methods of utilizing temporal information of adjacent video frames are mainly the optical flow method, Long Short-Term Memory and convolution among adjacent frames.

Keywords: LSTM; deep learning; human action recognition; optical flow; temporal information; video dataset; video object detection.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
The summarization of video object and human action detection. The paper is organized as this structure as well.

**Figure 2**
The training accuracy and the test accuracy of a CNN-LSTM model, which was implemented in nearly 3 days. The dataset is UCF101. The accuracy is an important metric on UCF101. The training accuracy is the result of the training data on the model. The test accuracy is the result of the test data on the model. The metric of accuracy is using image detection metrics by frames.

**Figure 3**
The network structure of YOLO, the purple ones are convolutional layers, the purple-red ones are max pooling layers. The structure has 26 convolutional layers and 4 max pooling layers. The layers are for extracting features of the objects, the rear layers are for classification of the objects. The design of the structure should have some fixed routines. The planes reflect the size of the feature maps. The dimension of the network is not shown in the figure. The video frame shown in the figure is from the YouTube Objects (YTO) dataset [19].

**Figure 4**
The network structure of YOLOv3. YOLOv3 uses the idea of Feature Pyramid Networks. Small-size feature maps are used to detect large-size objects, and large-size feature maps are used to detect small-size objects. YOLOv3 concatenates the output feature maps of 32 × 32, 16 × 16, 8 × 8 for detection.

**Figure 5**
The structure of YOLOv4, which has 3 parts: backbone, neck and head. The backbone is used to extract the features of the object, the neck is used to transmit the features, the head is a detector, which classifies the object in the image (frame), and indicates the location of the object in the image (frame). Each part is constructed by the convolutional layers and pooling layers.

**Figure 6**
The structure of using-dilated-convolution UAV detection, which has 5 Inception modules.

**Figure 7**
The workflow of FastUAV-NET.

**Figure 8**
The structure of DSSD. The orange boxes are convolutional layers, the light-orange boxes are pooling layers, the lilac boxes are de-convolutional layers, the lilac gates represent the concatenation of the convolution and the deconvolution. There are 10 convolutional layers, 1 pooling layer and 5 de-convolutional layers in the main pipeline. The planes reflect the size of the feature maps, the thickness reflects the dimension of the feature map. The six branches on the top right of the figure represent the prediction module, i.e., classification and object localization module.

**Figure 9**
The overall structure illustration of FSSD. The arrows show the information flow. The box of “Detection” is the detector, which output the classes and location of the object in the image or frame.

**Figure 10**
The universal work flow of the two-stage video object detection. The framework is based on the two-stage image object detection, which is the “Image detection” module in the figure.

**Figure 11**
The structure of Minimum Delay video object detection. The single frame detector is a one-stage detector, and the rest is a two-stage detector (include feature extractor and classifier), thus we regard it as a mixed-stage object detector. The structure has two shortcut connections, and a feedback connection, which have improved the detection.

**Figure 12**
The structure of Convolutional Regression Tracking. Convolutional Regression Tracking is located between the 2 convolutional neural network (CNN) pipelines. The structure can improve the mAP of the image object detector, which can be used as video object detector.

**Figure 13**
The architecture of Association LSTM. SSD is a one-stage detector, which is described before. FC is the fully connected layers. Association Error generates the object classification, Regression Error generates the object localization.

**Figure 14**
The workflow of TD-Graph LSTM.

**Figure 15**
The structure of Two-Path convLSTM Pyramid. The detection result of the previous frame is aggregated into the detection process of the next frame.

**Figure 16**
The structure of STMN. Spatial-Temporal Memory Module (STMM) can extract and transmit the spatial-temporal features. STMN does not use the fully connected layer after Position Sensitive RoI Pooling.

**Figure 18**
The illustration of DFF. Net_feature is a feature extractor (backbone), Net_task is a detector, flow function is the optical flow net [137]. The video frames are from YTO dataset.

**Figure 19**
The structure of FGFA. Optical flow is a feature transmission method.

**Figure 20**
The structure of Long Short-Term Feature Aggregation (LSFA). Flow Net implements the optical flow method, large feature extraction network extracts the complex features from the key frame, tiny feature extraction network extracts the simple features from the non-key frame.

**Figure 21**
Three-dimensional Convolution. Every 3 adjacent feature maps are convolved to the next feature map, and move on in this style.

**Figure 22**
The structure of TCN. Every 3 adjacent feature maps are convolved to the next feature map. The blue connection is a residual skip connection.

**Figure 23**
The illustration of Detect to Tracks and Tracks to Detect. The two CNN pipelines are correlated for the RoI Tracking, for the purpose of enhancing the video object detection.

**Figure 24**
The structure of RRM. ⊖ represents the feature map subtraction operation, which can highlight the differences among the adjacent frames. ⊕ represents the plus operation, which can highlight the similarities among the adjacent frames.

**Figure 25**
The workflow of proposed video detection system.

**Figure 26**
The illustration of MEGA. The arrows denote the directions of the aggregation. The depth of the color indicates the sequence.

**Figure 27**
The illumination of TSM. TSM shifts the feature map tensor along the temporal sequence, forward or backward. The empty positions are filled with zeros.

**Figure 28**
The work flow of High Quality Object Linking.

**Figure 30**
The pipeline of PEN, the final detection is concatenated by Pedestrian Recognition Network and the previous Region Proposal Network.

**Figure 31**
The workflow of Short-Term Anchor Linking and Long-Term Self-Guided Attention.

See this image and copyright information in PMC

References

1. Dalal N., Triggs B. Histograms of oriented gradients for human detection; Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05); San Diego, CA, USA. 20–25 June 2005; pp. 886–893.
1. Lowe D.G. Object recognition from local scale-invariant features; Proceedings of the Seventh IEEE International Conference on Computer Vision; Kerkyra, Greece. 20–27 September 1999; pp. 1150–1157.
1. Viola P., Jones M. Rapid object detection using a boosted cascade of simple features; Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001; Kauai, HI, USA. 8–14 December 2001.
1. Haar A. Zur theorie der orthogonalen funktionensysteme. Math. Ann. 1910;69:331–371. doi: 10.1007/BF01456326. - DOI
1. Farid H. Blind inverse gamma correction. IEEE Trans. Image Process. 2001;10:1428–1433. doi: 10.1109/83.951529. - DOI - PubMed

Publication types

Actions

Grants and funding

61773360/National Natural Science Foundation of China

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Visual Feature Learning on Video Object and Human Action Detection: A Systematic Review

Affiliations

Visual Feature Learning on Video Object and Human Action Detection: A Systematic Review

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources