SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos
Authors: Yingying Jiao, Zhigang Wang, Sifan Wu, Shaojing Fan, Zhenguang Liu, Zhuoyue Xu, Zheqi Wu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We carried out thorough evaluations for video pose propagation and video pose estimation tasks on three popular benchmarks: Pose Track2017 (Iqbal, Milan, and Gall 2017), Pose Track2018 (Andriluka et al. 2018), and Pose Track21 (Doering et al. 2022). The videos in these datasets feature diverse challenges, such as crowded scenes and rapid movements. We evaluate our model using the standard pose estimation metric, average precision (AP), by initially calculating the AP for each joint and subsequently deriving the model s overall performance through the mean average precision (m AP) across all joints. The results of video pose propagation on Pose Track2017 (Iqbal, Milan, and Gall 2017), Pose Track2018 (Andriluka et al. 2018), and Pose Track2021 (Doering et al. 2022) datasets. We conduct a comprehensive evaluation of each component in our proposed STDPose framework, presenting the quantitative results in Table 4. |
| Researcher Affiliation | Academia | 1College of Computer Science and Technology, Jilin University 2Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University 3College of Computer Science and Technology, Zhejiang Gongshang University 4School of Computing, National University of Singapore 5The State Key Laboratory of Blockchain and Data Security, Zhejiang University 6Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the proposed method in Section 3 and its subsections, outlining the components and their interactions in paragraph form. No explicitly labeled 'Pseudocode', 'Algorithm', or structured code blocks are present in the main text. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository in the main text or supplementary information. There are no phrases like "We release our code..." or links to GitHub/GitLab. |
| Open Datasets | Yes | We carried out thorough evaluations for video pose propagation and video pose estimation tasks on three popular benchmarks: Pose Track2017 (Iqbal, Milan, and Gall 2017), Pose Track2018 (Andriluka et al. 2018), and Pose Track21 (Doering et al. 2022). The videos in these datasets feature diverse challenges, such as crowded scenes and rapid movements. We utilize a standard Vision Transformer (Dosovitskiy et al. 2020) pretrained on the COCO dataset (Lin et al. 2014) as the backbone network of our STDPose framework. |
| Dataset Splits | Yes | We carried out thorough evaluations for video pose propagation and video pose estimation tasks on three popular benchmarks: Pose Track2017 (Iqbal, Milan, and Gall 2017), Pose Track2018 (Andriluka et al. 2018), and Pose Track21 (Doering et al. 2022). By varying parameter T, we control the proportion of manually-labeled frames, with T=2 indicating a 50/50 split. We then evaluate the pose estimation performance on Pose Track2017 validation set. As shown in Table 3, pseudo-labels generated from pose propagation significantly improves pose estimation when dealing with sparsely-labeled videos. Our model achieves 84.3 m AP at T=4, close to FAMI-Pose (Liu et al. 2022a). Notably, at T=2, our model excels over FAMI-Pose (Liu et al. 2022a), achieving 85.2 m AP with only 50% of the manually-labeled frames, demonstrating superior performance with only half the labeled data. We conduct a comprehensive evaluation of each component in our proposed STDPose framework, presenting the quantitative results in Table 4. All the ablation studies are conducted on the Pose Track2017 validation set. |
| Hardware Specification | No | The paper does not explicitly describe any specific hardware used to run its experiments, such as GPU models, CPU models, or cloud computing specifications. It only mentions general experimental settings. |
| Software Dependencies | No | The paper mentions using a "Vision Transformer... pretrained on the COCO dataset as the backbone network" but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed for replication. |
| Experiment Setup | Yes | The input image size is 256 192. We utilize a standard Vision Transformer (Dosovitskiy et al. 2020) pretrained on the COCO dataset (Lin et al. 2014) as the backbone network of our STDPose framework. We set the parameters α to 0.1 and β to 0.01 in Eq. 2, and have not densely tuned them. |