SiMHand: Mining Similar Hands for Large-Scale 3D Hand Pose Pre-training
Authors: Nie Lin, Takehiko Ohkawa, Yifei Huang, Mingfang Zhang, Minjie Cai, Ming Li, Ryosuke Furuta, Yoichi Sato
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that our method outperforms conventional contrastive learning approaches that produce positive pairs solely from a single image with data augmentation. We achieve significant improvements over the state-of-the-art method (Pe CLR) in various datasets, with gains of 15% on Frei Hand, 10% on Dex YCB, and 4% on Assembly Hands. Our experiments demonstrate that our approach surpasses prior pre-training methods and achieves robust performances across different hand pose datasets. |
| Researcher Affiliation | Academia | 1The University of Tokyo, 2Hunan University EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methodological steps and includes mathematical formulas and diagrams (e.g., Fig. 3 for overview, Eq. 4 for loss function), but it does not contain a clearly labeled pseudocode or algorithm block with structured steps formatted like code. |
| Open Source Code | Yes | Our code is available at https://github.com/ut-vision/Si MHand. |
| Open Datasets | Yes | Specifically, we collected 2.0M hand images from recent human-centric videos, such as 100DOH and Ego4D. ... We processed two datasets, Ego4D (Grauman et al., 2022) and 100DOH (Shan et al., 2020)... We conduct fine-tuning experiments on three datasets with 3D hand pose ground truth in various data size and viewpoints: exocentric datasets from Frei Hand (Zimmermann et al., 2019) and Dex YCB (Chao et al., 2021), and an egocentric dataset Assembly Hands (Ohkawa et al., 2023b). |
| Dataset Splits | Yes | Frei Hand consists of 130.2K training frames and 3.9K test frames... Dex YCB contains 325.3K training images and 98.2K test images... Assembly Hands, the largest of the three, includes 704.0K training samples and 109.8K test samples... Following (Spurr et al., 2021), we prepare 10% of the labeled Frei Hand dataset, which is denoted as Frei Hand*, especially used for ablation studies. This allow us to assess the performance in a limited supervision setting. ... Fig. 4 illustrates the experiment under different proportions of labeled fine-tuning data, namely 10%, 20%, 40%, and 80% in Frei Hand. |
| Hardware Specification | Yes | We use 8 NVIDIA V100 GPUs with a batch size of 8192 for pre-training. ... We use a single NVIDIA V100 GPU with a batch size of 128. |
| Software Dependencies | No | The paper mentions using ResNet-50 as the encoder, LARS and ADAM optimizers, and MediaPipe for keypoint extraction, but does not provide specific version numbers for these or other software libraries. |
| Experiment Setup | Yes | For similar hands mining, we choose the PCA embedding size as D = 14. For the pre-training framework, we use Res Net-50 (He et al., 2016) as the encoder. Throughout the pre-training phase, all models are trained using LARS (You et al., 2017) with ADAM (Kingma & Ba, 2014) optimizer, with the learning rate of 3.2e-3. Following (Spurr et al., 2021), Sim CLR employs scale and color jitter as image augmentation, while Pe CLR and Si MHand utilize scale, rotation, translation, and color jitter. We use resized images with 128x128 as the input. We set the temperature parameter τ of contrastive learning as 0.5. We use 8 NVIDIA V100 GPUs with a batch size of 8192 for pre-training. ... We use a single NVIDIA V100 GPU with a batch size of 128. |