Multi-Candidate Motion Modeling for 3D Human Pose and Shape Estimation from Monocular Video

accepted by IEEE International Conference on Multimedia & Expo (ICME) 2024
Oral presentation

Wen-Li Wei and Jen-Chun Lin

Institute of Information Science, Academia Sinica, Taiwan

{lilijinjin, jenchunlin}@gmail.com

Abstract

Estimating 3D human pose and shape from monocular video is an ill-posed problem due to depth ambiguity. Yet, most existing methods overlook the potential multiple motion hypotheses arising from this ambiguity. To tackle this, we propose a multi-candidate motion pose and shape network (MMPS-Net), which is designed to generate temporal representations of multiple plausible motion candidates and yield their adaptive fusion for 3D human pose and shape estimation. Specifically, we first propose a multi-candidate motion continuity attention (MMoCA) module to generate multiple kinematically compliant motion candidates. Second, we introduce a multi-candidate cross-attention (MCA) module to enable information passing among candidates to strengthen their relevance. Third, we develop a multi-candidate hierarchical attentive feature integration (MHAFI) module to refine the target frame’s feature representation by capturing temporal correlations within each motion candidate and adaptively integrating all candidates. By coupling these designs, MMPS-Net surpasses video-based methods on the 3DPW, MPI-INF-3DHP, and Human3.6M benchmarks.

Overview of our multi-candidate motion pose and shape network (MMPS-Net). MMPS-Net estimates pose, shape, and camera parameters Θ_t in the video sequence based on the static feature extractor, multi-candidate temporal encoder, multi-candidate temporal communicator, multi-candidate temporal integrator, and SMPL parameter regressor to generate 3D human pose and shape.

Demo -- MMPS-Net (Ours)

Note: These videos contain background music

Character Animation

Note: These videos contain background music

Qualitative comparison

Qualitative comparison of TCMR [1] and our MMPS-Net on the challenging in-the-wild 3DPW [3] dataset

Qualitative comparison of TCMR [1], MPS-Net [2], and our MMPS-Net on the MPI-INF-3DHP [4] dataset

Qualitative comparison of MPS-Net [2] and our MMPS-Net on the Human3.6M [5] dataset

Motion Candidates Comparison between MMoCA and MCA Modules

References (* authors contributed equally)

[1] Hongsuk Choi, Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. Beyond static features for temporally consistent 3D human pose and shape from a video. CVPR, 2021.

[2] Wen-Li Wei*, Jen-Chun Lin*, Tyng-Luh Liu, and Hong-Yuan Mark Liao. Capturing humans in motion: temporal-attentive 3D human pose and shape estimation from monocular video. CVPR, 2022.

[3] Timo von Marcard, Roberto Henschel, Michael J. Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3D human pose in the wild using IMUs and a moving camera. ECCV, 2018.

[4] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3D human pose estimation in the wild using improved CNN supervision. 3DV, 2017.

[5] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE TPAMI, 36(7):1325–1339, 2014.

BibTeX

@INPROCEEDINGS{Multi-CandidateMotionModeling2024,
    author = {Wen-Li Wei and Jen-Chun Lin},
    title = {Multi-Candidate Motion Modeling for 3D Human Pose and Shape Estimation from Monocular Video},
    booktitle = {IEEE International Conference on Multimedia & Expo (ICME)},
    year = {2024},
    month = {July},
}

All images and videos on the website are for research purposes only.