ExtPose: Robust and Coherent Pose Estimation by Extending ViTs

Authors: Rongyu Chen, Li’An Zhuo, Linlin Yang, Qi Wang, Liefeng Bo, Bang Zhang, Angela Yao

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We achieve state-of-the-art (SOTA) performance on multiple human and hand pose estimation benchmarks with substantial improvements to 34.0mm (-23%) on 3DPW and 4.9mm (-18%) on Frei HAND in PA-MPJPE over the other Vi T-based methods respectively. Additionally, we evaluate thoroughly the effectiveness of key design components, including 2D pose representations, fusion strategies, and learning capabilities. The framework is effective on the human body and hands for both image- and video-based settings. It binds these relevant settings with the foundational Vi T-based methods; thus, they can benefit from advances made in Vi T-based methods.
Researcher Affiliation Collaboration 1Computer Vision & Machine Learning Group, National University of Singapore 2Tongyi Lab, Alibaba Group 3Communication University of China. Correspondence to: Bang Zhang <EMAIL>.
Pseudocode No The paper describes the methods in prose and mathematical formulations within the main text. There are no explicit blocks labeled 'Pseudocode' or 'Algorithm'.
Open Source Code No https://gloryyrolg.github.io/extpose
Open Datasets Yes Settings. Following standard practice (Shin et al., 2024; Goel et al., 2023), EXTPOSE initialized from HMR2.0 is trained on mixed 3D datasets including 3DPW (Von Marcard et al., 2018), Human3.6M (Ionescu et al., 2013), MPIINF-3DHP (Mehta et al., 2017), and COCO (Lin et al., 2014). For evaluation on hand video and image benchmarks below, we follow and also train Ha Me R (Pavlakos et al., 2024) on multiple 3D datasets, Frei HAND (Zimmermann et al., 2019), HO3D (Hampali et al., 2020), MTC (Xiang et al., 2019), RHD (Zimmermann & Brox, 2017), Inter Hand2.6M (Moon et al., 2020), H2O3D (Hampali et al., 2020), Dex YCB (Chao et al., 2021), and 2D datasets, COCO Whole Body (Jin et al., 2020), Halpe (Fang et al., 2022) and MPII NZSL (Simon et al., 2017).
Dataset Splits No Following standard practice (Shin et al., 2024; Goel et al., 2023), EXTPOSE initialized from HMR2.0 is trained on mixed 3D datasets including 3DPW (Von Marcard et al., 2018), Human3.6M (Ionescu et al., 2013), MPIINF-3DHP (Mehta et al., 2017), and COCO (Lin et al., 2014). The paper refers to 'standard practice' for training but does not specify the exact train/test/validation splits or their percentages for the datasets used in their experiments.
Hardware Specification Yes Training lasts for 50K iterations with a batch size of 32 on 8 A100 GPUs. The per-frame computation time (running time) of core modules is measured with batch frames of 1024 on one A100 GPU.
Software Dependencies No We use the PyTorch implementation of scaled dot product attention with the mask and accelerated flash attention to speed computation and save GPU memory. An AdamW optimizer (Loshchilov & Hutter, 2019) is deployed with a learning rate 1e-5, β1 = 0.9, β2 = 0.999, and a weight decay of 1e3. The paper mentions software such as PyTorch and AdamW, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes During the training of all backbone parameters, besides standard affine and color data augmentation (Goel et al., 2023), each modality is masked out as a whole with a probability of 50% to cultivate EXTPOSE s ability to extract features in each input modality individually. We use the PyTorch implementation of scaled dot product attention with the mask and accelerated flash attention to speed computation and save GPU memory. An AdamW optimizer (Loshchilov & Hutter, 2019) is deployed with a learning rate 1e-5, β1 = 0.9, β2 = 0.999, and a weight decay of 1e3. Training lasts for 50K iterations with a batch size of 32 on 8 A100 GPUs.