Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions?
Authors: BOSHEN XU, Ziheng Wang, Yang Du, Zhinan Song, Sipeng Zheng, Qin Jin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that Ego NCE++ significantly enhances Ego HOI understanding, leading to improved performance across various Ego VLMs in tasks such as multi-instance retrieval, action recognition, and temporal understanding. Our code is available at https://github.com/xuboshen/Ego NCEpp. |
| Researcher Affiliation | Academia | 1 Renmin University of China 2 Beijing Academy of Artificial Intelligence |
| Pseudocode | No | The paper describes methods using equations and prose, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/xuboshen/Ego NCEpp. |
| Open Datasets | Yes | Recent works (Lin et al., 2022) have utilized the large-scale dataset Ego4D (Grauman et al., 2022) to pretrain egocentric video-language models (Ego VLMs), enhancing performance in tasks, such as egocentric video-text retrieval (Lin et al., 2022; Sigurdsson et al., 2018b) and action recognition (Sigurdsson et al., 2018a). To delve deeper into this question, we introduce Ego HOIBench, a novel multi-choice testbed derived from Ego4D. Our pretraining video clips are sourced from Ego Clip-3.8M (Lin et al., 2022), ensuring no overlap with the clips used in Ego HOIBench. Downstream Benchmark and Evaluation Setups. We evaluate our model on three types of tasks across seven benchmarks in a zero-shot setting: (1) Open-vocabulary recognition: tasks that test video-text matching for video-and-language models. We evaluate on Ego HOIBench, EK-100OV (Chatterjee et al., 2024), and Action Bench (Wang et al., 2023b). (2) Multi-instance retrieval: conducted on Epic-Kitchens-100 (Damen et al., 2021). (3) Action recognition: tested on Charades Ego (Sigurdsson et al., 2018a), EK-100-CLS (Damen et al., 2021), and EGTEA (Li et al., 2018). |
| Dataset Splits | Yes | We evaluate our model on three types of tasks across seven benchmarks in a zero-shot setting: (1) Open-vocabulary recognition: tasks that test video-text matching for video-and-language models. We evaluate on Ego HOIBench, EK-100OV (Chatterjee et al., 2024), and Action Bench (Wang et al., 2023b). (2) Multi-instance retrieval: conducted on Epic-Kitchens-100 (Damen et al., 2021). (3) Action recognition: tested on Charades Ego (Sigurdsson et al., 2018a), EK-100-CLS (Damen et al., 2021), and EGTEA (Li et al., 2018). For the zero-shot setting, we conduct video-text matching for retrieval tasks, using 16 frames for evaluation. For the fine-tune setting, we finetune the Ego VLMs using the Adam W optimizer. For the fine-tuning setup, we leverage the visual encoder and attach an additional linear projection head for the classification purpose, following Kazakos et al. (2021). The models are trained and evaluated on the first split of the validation set. |
| Hardware Specification | Yes | The models are continually pretrained for 10 epochs over a period of 12 hours using 8 A800 GPUs, with a total batch size of 576. |
| Software Dependencies | No | The paper mentions using specific models like LLa MA3-8B (AI, 2024), Distil BERT (Sanh et al., 2019), CLIP (Radford et al., 2021), Ro BERTa (Liu et al., 2019), and libraries like spa Cy (Honnibal et al., 2020) and Decord. However, specific version numbers for these software libraries are not provided. |
| Experiment Setup | Yes | We employ Lo RA tuning with both rank and alpha set to 16. The models are continually pretrained for 10 epochs over a period of 12 hours using 8 A800 GPUs, with a total batch size of 576. We utilize LLa MA3-8B (AI, 2024) to generate negative captions for the videos. For all models, we adopt the Adam W optimizer with parameters β1 = 0.9 and β2 = 0.999. The learning rate follows a cosine annealing schedule, starting at 3e-5 and gradually reducing to 3e-7. During pretraining, we sample 4 frames from each video. During training, we apply standard Random Resized Crop for data augmentation and employ Lo RA tuning to continuously pretrain our Ego VLM. |