HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction

Authors: Chen Bao, Jiarui Xu, Xiaolong Wang, Abhinav Gupta, Homanga Bharadhwaj

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform experiments for both the proposed tasks in order to answer the following research questions: How plausible are the hand trajectories produced by Hands On VLM? Does Hands On VLM exhibit reasoning abilities for implicit language queries? Does Hands On VLM generalize zero-shot to unseen scenes from new datasets? 5.1 Experiment Details 5.2 Metrics and Baselines 5.3 Comparisons with Baselines 5.4 Ablation Study
Researcher Affiliation Academia Chen Bao EMAIL Carnegie Mellon University Jiarui Xu EMAIL UC San Diego Xiaolong Wang EMAIL UC San Diego Abhinav Gupta EMAIL Carnegie Mellon University Homanga Bharadhwaj EMAIL Carnegie Mellon University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It provides diagrams of the training and inference pipelines in Figures 5 and 6, but these are not formatted as pseudocode.
Open Source Code Yes We develop benchmarks for evaluating progress on the VHP and RBHP tasks which we open-source to the community, in addition to our trained models on the respective benchmarks. More details can be found at https://www.chenbao.tech/handsonvlm/.
Open Datasets Yes We choose Epic-Kitchen (Damen et al., 2018; 2022), H2O (Kwon et al., 2021) and FPHA (Garcia-Hernando et al., 2018) as datasets for this task. ... We choose Epic-Kitchen and Ego4D (Grauman et al., 2022) as datasets for this task.
Dataset Splits Yes Table 1: Comparison of VHP task with different baselines. We reported the performance on the validation split of Epic-Kitchen dataset. For the RBHP baselines, we also evaluate them on two unseen datasets, H2O and FPHA. Table 9: Data Statistics of VHP and RBHP task. Task Dataset Training Samples Validation Samples Epic-Kitchen-55 8523 1894 Epic-Kitchen-100 24148 3513 H2O 503 FPHA 501 RBHP Epic-Kitchen-100 4018 3513 Ego4D 8673
Hardware Specification Yes The total wall-clock time for training is around 18 hours for the 7B models while using 8 H100 GPUs.
Software Dependencies No The paper mentions specific models and architectures like CLIP-L-14, Vicuna, LLaVA, and CVAE, but does not provide specific version numbers for the underlying software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes For VHP and RBHP datasets, we sample 10 frames and predict the hand position in next 4 frames at FPS = 4. In addition to our proposed datasets, Hands On VLM are also trained on a few additional datasets for five different tasks... We use a batch size of 128, a learning rate of 2e-5 and train for 40 epochs. The total wall-clock time for training is around 18 hours for the 7B models while using 8 H100 GPUs. The LLM and vision-language projector are initialized with the LLa VA-1.3 pre-trained weights. During training, we freeze the visual backbone and fully fine-tune other modules.