Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning
Authors: Baoqi Pei, Yifei Huang, Jilan Xu, Guo Chen, Yuping He, Lijin Yang, Yali Wang, Weidi Xie, Yu Qiao, Fei Wu, Limin Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple egocentric downstream tasks, including improvements of 6.3% in EK-100 multi-instance retrieval, 5.7% in EK-100 classification, and 16.3% in EGTEA classification in zero-shot settings. Furthermore, our model exhibits robust generalization capabilities in hand-object interaction and robot manipulation tasks. |
| Researcher Affiliation | Academia | 1Zhejiang University, 2Shanghai Artificial Intelligence Laboratory, 3The University of Tokyo, 4Fudan University, 5Nanjing University, 6SIAT, 7Shanghai Jiao Tong University, EMAIL; EMAIL |
| Pseudocode | No | The paper describes the methodology using text, equations (e.g., Equation 1 and 2), and architectural diagrams (Figure 2, Figure 4) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and data are available at https://github.com/Open Robot Lab/Ego HOD/. |
| Open Datasets | Yes | Pretraining Dataset. As stated in the previous section, the source of our pretraining data comes from Ego4D (Grauman et al., 2022) and How2-Interlink-7M (Wang et al., 2024a). Downstream Tasks. We evaluate models on several egocentric downstream tasks: (1) Epic Kitchens-100 (Damen et al., 2020) (EK-100) tasks... 2) Ego4D (Grauman et al., 2022) tasks... 3) EGTEA (Li et al., 2018) tasks... 4) Other tasks. We also evaluate our model on GTEA (Fathi et al., 2011b) and HOI4D (Liu et al., 2022) datasets for the action segmentation task. Meanwhile, to show the generalization ability of our learned video representation, we evaluate the task success rate on Franka Kitchen dataset (Gupta et al., 2019)... |
| Dataset Splits | Yes | We train a style classifier P by manually annotating 10,000 clips as ego-like or non-ego-like . With this classifier, we obtain an additional 3.4M egocentric-style clips. More details can be found in Appendix A. Data Selection Since our HOD involves data from not only Ego4D but also How2link-7M, we use a style classifier P to filter egocentric style videos from the How2link-7M dataset. Specifically, our style classifier employs a simple two-layer MLP architecture. We utilize Intern Video2 (Wang et al., 2024b) to extract video features from all videos of the How2-Interlink7M dataset. After that, we manually annotate 10,000 clips with positive and negative labels, where the positive label indicates this video is an egocentric video (or ego-like video). Examples of positive and negative labeled videos can be found in Figure 5. We randomly select 10% of these clips to form the validation set. Action segmentation tests the representation on its understanding of the temporal dependencies of the video Huang et al. (2020b); Yi et al. (2021). We evaluate our model on two benchmark datasets: GTEA (Fathi et al., 2011b), and HOI4D (Liu et al., 2022). We follow the previous work to use four-fold cross-validations on both datasets. |
| Hardware Specification | Yes | For the computational cost, it takes around 2 days to extract bounding boxes from all vision-language clips and 3 days to generate narrations with LLM using 32 A100 GPUs, resulting in a total of around 4000 GPU hours. For Ego Video-B, we adopt a batch size of 128 over 16 GPUs with a fixed learning rate of 5e-5, For Ego Video-L, we use a batch size of 32 over 16 GPUs with a fixed learning rate of 3e-5. For Ego Video-G, we choose to use a batch size of 16 over 16 GPUs with a fixed learning rate of 1e-5. In all tasks we use 8 GPUs for finetuning. For the Ego NLQ task(Grauman et al., 2022), we build on the methodologies introduced by Ego VLP (Lin et al., 2022) and LAVILA (Zhao et al., 2023) for fairness. We adopt VSLNet (Zhang et al., 2020) as the task head. We train the task head for 50 epochs, using a learning rate of 3e-3, dropout 0.3, batch size 32 on a single A100 GPU. |
| Software Dependencies | No | The paper mentions various models and tools such as Adam W optimizer, GPT-like Transformer, CLIP, ViT, Internvideo2, Yi-34B, GPT-4o, VSLNet, VSGN, MS-TCN, ASFormer, and Diff Act, but does not specify version numbers for any of them. |
| Experiment Setup | Yes | Pretraining Details We pre-train on the video-narration pairs generated by our HOD from Ego4D and How-Inter Link7M. We use Adam W optimizer with betas = (0.9,0.999) for 15 epochs. We use different settings for different size models. For Ego Video-B, we adopt a batch size of 128 over 16 GPUs with a fixed learning rate of 5e-5, For Ego Video-L, we use a batch size of 32 over 16 GPUs with a fixed learning rate of 3e-5. For Ego Video-G, we choose to use a batch size of 16 over 16 GPUs with a fixed learning rate of 1e-5. For input frames, we preprocess the frames by resizing the shorter side to 320 pixels, which accelerates the data loading speed. Subsequently, we applied a standard Random Resized Crop function (Zhao & Kr ahenb uhl, 2023) with a scale parameter of (0.5, 1.0) to obtain the corresponding input frames. Finetuning Details We finetune the downstream tasks using Adam W with (β1, β2) = (0.9, 0.999) and weight decay of 0.05 with cosine annealing. Table 12 shows the hyperparameters details and in all tasks we use 8 GPUs for finetuning. For the Ego NLQ task... We train the task head for 50 epochs, using a learning rate of 3e-3, dropout 0.3, batch size 32 on a single A100 GPU. For the Ego MQ task... We set batch size as 16, learning rate as 2e-4, gamma as 0.6, and train the task head on a single A100 GPU. |