Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Dec 31;13(1):72.
doi: 10.3390/mi13010072.

Visual Feature Learning on Video Object and Human Action Detection: A Systematic Review

Affiliations
Review

Visual Feature Learning on Video Object and Human Action Detection: A Systematic Review

Dengshan Li et al. Micromachines (Basel). .

Abstract

Video object and human action detection are applied in many fields, such as video surveillance, face recognition, etc. Video object detection includes object classification and object location within the frame. Human action recognition is the detection of human actions. Usually, video detection is more challenging than image detection, since video frames are often more blurry than images. Moreover, video detection often has other difficulties, such as video defocus, motion blur, part occlusion, etc. Nowadays, the video detection technology is able to implement real-time detection, or high-accurate detection of blurry video frames. In this paper, various video object and human action detection approaches are reviewed and discussed, many of them have performed state-of-the-art results. We mainly review and discuss the classic video detection methods with supervised learning. In addition, the frequently-used video object detection and human action recognition datasets are reviewed. Finally, a summarization of the video detection is represented, e.g., the video object and human action detection methods could be classified into frame-by-frame (frame-based) detection, extracting-key-frame detection and using-temporal-information detection; the methods of utilizing temporal information of adjacent video frames are mainly the optical flow method, Long Short-Term Memory and convolution among adjacent frames.

Keywords: LSTM; deep learning; human action recognition; optical flow; temporal information; video dataset; video object detection.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
The summarization of video object and human action detection. The paper is organized as this structure as well.
Figure 2
Figure 2
The training accuracy and the test accuracy of a CNN-LSTM model, which was implemented in nearly 3 days. The dataset is UCF101. The accuracy is an important metric on UCF101. The training accuracy is the result of the training data on the model. The test accuracy is the result of the test data on the model. The metric of accuracy is using image detection metrics by frames.
Figure 3
Figure 3
The network structure of YOLO, the purple ones are convolutional layers, the purple-red ones are max pooling layers. The structure has 26 convolutional layers and 4 max pooling layers. The layers are for extracting features of the objects, the rear layers are for classification of the objects. The design of the structure should have some fixed routines. The planes reflect the size of the feature maps. The dimension of the network is not shown in the figure. The video frame shown in the figure is from the YouTube Objects (YTO) dataset [19].
Figure 4
Figure 4
The network structure of YOLOv3. YOLOv3 uses the idea of Feature Pyramid Networks. Small-size feature maps are used to detect large-size objects, and large-size feature maps are used to detect small-size objects. YOLOv3 concatenates the output feature maps of 32 × 32, 16 × 16, 8 × 8 for detection.
Figure 5
Figure 5
The structure of YOLOv4, which has 3 parts: backbone, neck and head. The backbone is used to extract the features of the object, the neck is used to transmit the features, the head is a detector, which classifies the object in the image (frame), and indicates the location of the object in the image (frame). Each part is constructed by the convolutional layers and pooling layers.
Figure 6
Figure 6
The structure of using-dilated-convolution UAV detection, which has 5 Inception modules.
Figure 7
Figure 7
The workflow of FastUAV-NET.
Figure 8
Figure 8
The structure of DSSD. The orange boxes are convolutional layers, the light-orange boxes are pooling layers, the lilac boxes are de-convolutional layers, the lilac gates represent the concatenation of the convolution and the deconvolution. There are 10 convolutional layers, 1 pooling layer and 5 de-convolutional layers in the main pipeline. The planes reflect the size of the feature maps, the thickness reflects the dimension of the feature map. The six branches on the top right of the figure represent the prediction module, i.e., classification and object localization module.
Figure 9
Figure 9
The overall structure illustration of FSSD. The arrows show the information flow. The box of “Detection” is the detector, which output the classes and location of the object in the image or frame.
Figure 10
Figure 10
The universal work flow of the two-stage video object detection. The framework is based on the two-stage image object detection, which is the “Image detection” module in the figure.
Figure 11
Figure 11
The structure of Minimum Delay video object detection. The single frame detector is a one-stage detector, and the rest is a two-stage detector (include feature extractor and classifier), thus we regard it as a mixed-stage object detector. The structure has two shortcut connections, and a feedback connection, which have improved the detection.
Figure 12
Figure 12
The structure of Convolutional Regression Tracking. Convolutional Regression Tracking is located between the 2 convolutional neural network (CNN) pipelines. The structure can improve the mAP of the image object detector, which can be used as video object detector.
Figure 13
Figure 13
The architecture of Association LSTM. SSD is a one-stage detector, which is described before. FC is the fully connected layers. Association Error generates the object classification, Regression Error generates the object localization.
Figure 14
Figure 14
The workflow of TD-Graph LSTM.
Figure 15
Figure 15
The structure of Two-Path convLSTM Pyramid. The detection result of the previous frame is aggregated into the detection process of the next frame.
Figure 16
Figure 16
The structure of STMN. Spatial-Temporal Memory Module (STMM) can extract and transmit the spatial-temporal features. STMN does not use the fully connected layer after Position Sensitive RoI Pooling.
Figure 17
Figure 17
The flow chart of T-CNN.
Figure 18
Figure 18
The illustration of DFF. Netfeature is a feature extractor (backbone), Nettask is a detector, flow function is the optical flow net [137]. The video frames are from YTO dataset.
Figure 19
Figure 19
The structure of FGFA. Optical flow is a feature transmission method.
Figure 20
Figure 20
The structure of Long Short-Term Feature Aggregation (LSFA). Flow Net implements the optical flow method, large feature extraction network extracts the complex features from the key frame, tiny feature extraction network extracts the simple features from the non-key frame.
Figure 21
Figure 21
Three-dimensional Convolution. Every 3 adjacent feature maps are convolved to the next feature map, and move on in this style.
Figure 22
Figure 22
The structure of TCN. Every 3 adjacent feature maps are convolved to the next feature map. The blue connection is a residual skip connection.
Figure 23
Figure 23
The illustration of Detect to Tracks and Tracks to Detect. The two CNN pipelines are correlated for the RoI Tracking, for the purpose of enhancing the video object detection.
Figure 24
Figure 24
The structure of RRM. ⊖ represents the feature map subtraction operation, which can highlight the differences among the adjacent frames. ⊕ represents the plus operation, which can highlight the similarities among the adjacent frames.
Figure 25
Figure 25
The workflow of proposed video detection system.
Figure 26
Figure 26
The illustration of MEGA. The arrows denote the directions of the aggregation. The depth of the color indicates the sequence.
Figure 27
Figure 27
The illumination of TSM. TSM shifts the feature map tensor along the temporal sequence, forward or backward. The empty positions are filled with zeros.
Figure 28
Figure 28
The work flow of High Quality Object Linking.
Figure 29
Figure 29
The flow chart of STINet.
Figure 30
Figure 30
The pipeline of PEN, the final detection is concatenated by Pedestrian Recognition Network and the previous Region Proposal Network.
Figure 31
Figure 31
The workflow of Short-Term Anchor Linking and Long-Term Self-Guided Attention.

References

    1. Dalal N., Triggs B. Histograms of oriented gradients for human detection; Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05); San Diego, CA, USA. 20–25 June 2005; pp. 886–893.
    1. Lowe D.G. Object recognition from local scale-invariant features; Proceedings of the Seventh IEEE International Conference on Computer Vision; Kerkyra, Greece. 20–27 September 1999; pp. 1150–1157.
    1. Viola P., Jones M. Rapid object detection using a boosted cascade of simple features; Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001; Kauai, HI, USA. 8–14 December 2001.
    1. Haar A. Zur theorie der orthogonalen funktionensysteme. Math. Ann. 1910;69:331–371. doi: 10.1007/BF01456326. - DOI
    1. Farid H. Blind inverse gamma correction. IEEE Trans. Image Process. 2001;10:1428–1433. doi: 10.1109/83.951529. - DOI - PubMed

LinkOut - more resources