Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct 19;22(20):7975.
doi: 10.3390/s22207975.

Camera Motion Agnostic Method for Estimating 3D Human Poses

Affiliations

Camera Motion Agnostic Method for Estimating 3D Human Poses

Seong Hyun Kim et al. Sensors (Basel). .

Abstract

Although the performance of 3D human pose and shape estimation methods has improved considerably in recent years, existing approaches typically generate 3D poses defined in a camera or human-centered coordinate system. This makes it difficult to estimate a person's pure pose and motion in a world coordinate system for a video captured using a moving camera. To address this issue, this paper presents a camera motion agnostic approach for predicting 3D human pose and mesh defined in the world coordinate system. The core idea of the proposed approach is to estimate the difference between two adjacent global poses (i.e., global motion) that is invariant to selecting the coordinate system, instead of the global pose coupled to the camera motion. To this end, we propose a network based on bidirectional gated recurrent units (GRUs) that predicts the global motion sequence from the local pose sequence consisting of relative rotations of joints called global motion regressor (GMR). We use 3DPW and synthetic datasets, which are constructed in a moving-camera environment, for evaluation. We conduct extensive experiments and prove the effectiveness of the proposed method empirically.

Keywords: 3D human pose estimation; 3D human shape reconstruction; statistical shape model.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Given a runner video (first row), the proposed framework correctly reconstructs 3D running path (second and third rows), while VIBE-CAM, the combination of state-of-the-art human pose estimation methods [11,16,17], fails to reconstruct the 3D global pose of the runner (fourth row). The global pose represents the orientation and location of the entire body. The visualized reference frame is defined as being aligned with the person in the first frame. VIBE-CAM is detailed in Section 4.5.
Figure 2
Figure 2
The top row shows the image sequence rendered using only the local pose without the global pose. Here, the relative orientations between rigid body parts (i.e., local pose) change, but the entire body’s orientation and location (i.e., global pose) remain unchanged. The bottom row shows the rendering result for the case where the global pose is further included. Please note that the main purpose of the paper is to estimate the global pose sequence from the local pose sequence.
Figure 3
Figure 3
Overall framework of the proposed method. Given an input video, the existing 3D human pose estimation network outputs a local human pose sequence. The proposed global motion regressor generates a global motion sequence from the local pose sequence. In the inference stage, the global motion is accumulated into a global pose, and finally, the SMPL reconstructs a human mesh sequence with the global pose defined in the world coordinate system.
Figure 4
Figure 4
Architecture of Global Motion Regressor (GMR).
Figure 5
Figure 5
Curves for our loss and errors in the training process.
Figure 6
Figure 6
Vertex errors on training and test data acquired using different sampling rates. The numbers in the graph represent the vertex error over time.
Figure 7
Figure 7
Vertex error over time. The numbers in the graph represent the vertex error between the predicted human mesh and its ground-truth in the world coordinate system.
Figure 8
Figure 8
Qualitative comparison on the Mannequin Challenge dataset. The proposed method provides static human poses while VIBE-CAM reconstructs unexpected global human poses with respect to the camera movement in the input video. Note that the reference coordinate systems of VIBE-CAM is aligned with that of the proposed method for easy comparison.
Figure 9
Figure 9
Qualitative results on the 3DPW dataset. The downtown_walkDownhill_00 sequence is used as input to VIBE-CAM and our method.

References

    1. Huang Y., Bogo F., Lassner C., Kanazawa A., Gehler P.V., Romero J., Akhter I., Black M.J. Towards accurate marker-less human shape and pose estimation over time; Proceedings of the International Conference on 3D Vision (3DV); Qingdao, China. 10–12 October 2017.
    1. Pavlakos G., Zhou X., Derpanis K.G., Daniilidis K. Coarse-to-fine volumetric prediction for single-image 3D human pose; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Honolulu, HI, USA. 21–26 July 2017.
    1. Martinez J., Hossain R., Romero J., Little J.J. A simple yet effective baseline for 3d human pose estimation; Proceedings of the IEEE International Conference on Computer Vision (ICCV); Venice, Italy. 22–29 October 2017.
    1. Pavllo D., Feichtenhofer C., Grangier D., Auli M. 3D human pose estimation in video with temporal convolutions and semi-supervised training; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Long Beach, CA, USA. 15–20 June 2019.
    1. Kanazawa A., Black M.J., Jacobs D.W., Malik J. End-to-end recovery of human shape and pose; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Salt Lake City, UT, USA. 18–23 June 2018.