HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction
Authors: Chen Bao, Jiarui Xu, Xiaolong Wang, Abhinav Gupta, Homanga Bharadhwaj
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform experiments for both the proposed tasks in order to answer the following research questions: How plausible are the hand trajectories produced by Hands On VLM? Does Hands On VLM exhibit reasoning abilities for implicit language queries? Does Hands On VLM generalize zero-shot to unseen scenes from new datasets? 5.1 Experiment Details 5.2 Metrics and Baselines 5.3 Comparisons with Baselines 5.4 Ablation Study |
| Researcher Affiliation | Academia | Chen Bao EMAIL Carnegie Mellon University Jiarui Xu EMAIL UC San Diego Xiaolong Wang EMAIL UC San Diego Abhinav Gupta EMAIL Carnegie Mellon University Homanga Bharadhwaj EMAIL Carnegie Mellon University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. It provides diagrams of the training and inference pipelines in Figures 5 and 6, but these are not formatted as pseudocode. |
| Open Source Code | Yes | We develop benchmarks for evaluating progress on the VHP and RBHP tasks which we open-source to the community, in addition to our trained models on the respective benchmarks. More details can be found at https://www.chenbao.tech/handsonvlm/. |
| Open Datasets | Yes | We choose Epic-Kitchen (Damen et al., 2018; 2022), H2O (Kwon et al., 2021) and FPHA (Garcia-Hernando et al., 2018) as datasets for this task. ... We choose Epic-Kitchen and Ego4D (Grauman et al., 2022) as datasets for this task. |
| Dataset Splits | Yes | Table 1: Comparison of VHP task with different baselines. We reported the performance on the validation split of Epic-Kitchen dataset. For the RBHP baselines, we also evaluate them on two unseen datasets, H2O and FPHA. Table 9: Data Statistics of VHP and RBHP task. Task Dataset Training Samples Validation Samples Epic-Kitchen-55 8523 1894 Epic-Kitchen-100 24148 3513 H2O 503 FPHA 501 RBHP Epic-Kitchen-100 4018 3513 Ego4D 8673 |
| Hardware Specification | Yes | The total wall-clock time for training is around 18 hours for the 7B models while using 8 H100 GPUs. |
| Software Dependencies | No | The paper mentions specific models and architectures like CLIP-L-14, Vicuna, LLaVA, and CVAE, but does not provide specific version numbers for the underlying software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | For VHP and RBHP datasets, we sample 10 frames and predict the hand position in next 4 frames at FPS = 4. In addition to our proposed datasets, Hands On VLM are also trained on a few additional datasets for five different tasks... We use a batch size of 128, a learning rate of 2e-5 and train for 40 epochs. The total wall-clock time for training is around 18 hours for the 7B models while using 8 H100 GPUs. The LLM and vision-language projector are initialized with the LLa VA-1.3 pre-trained weights. During training, we freeze the visual backbone and fully fine-tune other modules. |