Oral presentation Wen-Li Wei and Jen-Chun Lin Institute of Information Science, Academia Sinica, Taiwan {lilijinjin, jenchunlin}@gmail.com |
Estimating 3D human pose and shape from monocular video is an ill-posed problem due to depth ambiguity. Yet, most existing methods overlook the potential multiple motion hypotheses arising from this ambiguity. To tackle this, we propose a multi-candidate motion pose and shape network (MMPS-Net), which is designed to generate temporal representations of multiple plausible motion candidates and yield their adaptive fusion for 3D human pose and shape estimation. Specifically, we first propose a multi-candidate motion continuity attention (MMoCA) module to generate multiple kinematically compliant motion candidates. Second, we introduce a multi-candidate cross-attention (MCA) module to enable information passing among candidates to strengthen their relevance. Third, we develop a multi-candidate hierarchical attentive feature integration (MHAFI) module to refine the target frame’s feature representation by capturing temporal correlations within each motion candidate and adaptively integrating all candidates. By coupling these designs, MMPS-Net surpasses video-based methods on the 3DPW, MPI-INF-3DHP, and Human3.6M benchmarks.
![]() Overview of our multi-candidate motion pose and shape network (MMPS-Net). MMPS-Net estimates pose, shape, and camera parameters Θt in the video sequence based on the static feature extractor, multi-candidate temporal encoder, multi-candidate temporal communicator, multi-candidate temporal integrator, and SMPL parameter regressor to generate 3D human pose and shape. |