EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos
Authors: Jilan Xu, Yifei Huang, Baoqi Pei, Junlin Hou, Qingqiu Li, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our proposed Ego Exo-Gen achieves better prediction performance compared to previous video prediction models on the Ego Exo4D and H2O benchmark datasets, with the HOI masks significantly improving the generation of hands and interactive objects in the ego-centric videos. We conduct extensive experiments on the cross-view video benchmark datasets, i.e., Ego Exo4D (Grauman et al., 2024) and H2O (Kwon et al., 2021) that include rich and diverse handobject interactions and shooting environment. Experimental results show that Ego Exo-Gen significantly outperforms prior video prediction models (Chen et al., 2023; Ren et al., 2024; Gu et al., 2023) and improves the quality of predicted videos by leveraging hand and object dynamics. Also, Ego Exo-Gen demonstrates strong zero-shot transfer ability on unseen actions and environments. We compare our method with prior video prediction models... As shown in Table 1, the fine-tuned Consist I2V (Ren et al., 2024) and SEINE (Chen et al., 2023) models achieve comparably higher accuracy over the video prediction models... Ego Exo-Gen consistently outperforms prior methods on all metrics, highlighting the benefits of explicit modeling the hand-object dynamics in video prediction models. We evaluate our model s generalisation ability on unseen data distribution, i.e., H2O (Kwon et al., 2021)... |
| Researcher Affiliation | Academia | 1School of Computer Science, Shanghai Key Lab of Intelligent Information Processing, Shanghai Collaborative Innovation Center of Intelligent Visual Computing, Fudan University 2The University of Tokyo, 3Zhejiang University, 4Hong Kong University of Science and Technology, 5Nanjing University, 6Shanghai Jiao Tong University, 7Shanghai Artificial Intelligence Laboratory |
| Pseudocode | No | The paper describes its methodology in detail in Section 2, accompanied by block diagrams (Figure 2). However, it does not include any explicitly labeled pseudocode or algorithm blocks. The procedural steps are described in paragraph text. |
| Open Source Code | No | The paper does not explicitly state that source code for the described methodology is released, nor does it provide any links to a code repository. Phrases like 'We release our code...' or direct GitHub links are absent. |
| Open Datasets | Yes | We conduct extensive experiments on the cross-view video benchmark datasets, i.e., Ego Exo4D (Grauman et al., 2024) and H2O (Kwon et al., 2021) that include rich and diverse handobject interactions and shooting environment. |
| Dataset Splits | Yes | The training set contains 33,448 video clips with an average duration of 1 second. Each video clip is paired with a narration (e.g., C drops the knife on the chopping board with his right hand.) with start and end timestamps. We sample 1,000 video clips from the validation set, from which we select 500 video clips and annotate them with HOI masks to evaluate the performance of the mask prediction model. The training and validation sets have distinct takes, posing challenges to the model s generalisation ability on unseen subjects and locations. To evaluate the model s zero-shot transfer ability, we also adopt H2O (Kwon et al., 2021), an ego-exo HOI dataset focusing on tabletop activities (e.g., squeeze lotion, grab spray). The validation set of H2O is composed of 122 clips with action labels. |
| Hardware Specification | No | The paper mentions training details such as optimizer, learning rate, batch size, epochs, and spatial resolution for both the mask prediction and video diffusion models. However, it does not specify any particular hardware like GPU models (e.g., NVIDIA A100), CPU models, or cloud computing resources used for these experiments. |
| Software Dependencies | No | The paper mentions several models and algorithms used (e.g., Res Net, CLIP text encoder, DDIM sampler, SEINE, SAM-2, Ego HOS, 100DOH, Sapiens, RAFT), and the Adam optimizer. However, it does not provide specific version numbers for any programming languages (like Python), libraries (like PyTorch or TensorFlow), or other key software components, which are necessary for reproducible software dependencies. |
| Experiment Setup | Yes | We train our cross-view mask prediction model for 30 epochs with a batch size of 32 using the Adam optimizer. The initial learning rate is set to of 10 5. We sample 16 frames with a fixed spatial resolution of 480 480 for both ego-centric and exo-centric videos. ... For the video diffusion model, we train both stages for 10 epochs with a batch size of 32 and a fixed learning rate of 10 4. We initialise our model with SEINE (Chen et al., 2023) pre-trained on web-scale video-text pairs, and train the model with 16 sampled frames with resolution 256 256. During inference, we adopt the DDIM sampler (Song et al., 2020) with 100 steps in our experiments. |