EgoSim: Egocentric Exploration in Virtual Worlds with Multi-modal Conditioning

Authors: Wei Yu, Songheng Yin, Steve Easterbrook, Animesh Garg

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we demonstrate that our new model Ego Sim achieves excellent results on both the Real Estate and newly repurposed Epic-Field datasets. We evaluate the proposed method Ego Sim on two competitive benchmarks, Real Estate and Epic Field datasets, in a multi-condition-input setting. Extensive experimental results show that our model achieves precise control that previous methods could not accomplish. This paper introduces Ego Sim, a compositional world simulator that can egocentrically explore and interact with the observed environment. To achieve this, we identify two major obstacles that prevent the creation of meaningful videos that accurately adhere to user-specified instructions and tackle them with spacetime epipolar attention and CI2V-adapter. Our designed model has demonstrated unprecedented controllability and the potential uses of such an egocentric world simulator are also diverse and impactful.
Researcher Affiliation Collaboration Wei Yu1,2, Songheng Yin3, Steve Easterbrook1, Animesh Garg2,4,5 1University of Toronto, 2Vector Institute, 3Columbia University 4NVIDIA 5Georgia Tech
Pseudocode No The paper includes mathematical equations for epipolar line calculation and attention (Equations 1, 3, 4) and a formula for Epipolar Attention, but no structured pseudocode or algorithm blocks are provided.
Open Source Code No The paper states: "For more results, please refer to https://egosim.github.io/Ego Sim/." This link is described as being for "more results" and is a project page, not an explicit statement of code release or a direct link to a code repository. The text does not provide a concrete access point for the source code of the methodology described in the paper.
Open Datasets Yes Through extensive experiments, we demonstrate that our new model Ego Sim achieves excellent results on both the Real Estate and newly repurposed Epic-Field datasets. For more results, please refer to https://egosim.github.io/Ego Sim/. We repurpose the Epic-Field dataset for egocentric video generation and established a new benchmark to evaluate video diffusion models in more interactive and dynamic settings. We evaluate the proposed method Ego Sim on two competitive benchmarks, Real Estate and Epic Field datasets, in a multi-condition-input setting. Real Estate Zhou et al. (2018) consists of a large number of open house video tours. Fortunately, we have identified a highly suitable dataset, Epic-Field Damen et al. (2018), which, although not previously used in video generation research, aligns perfectly with our requirements.
Dataset Splits Yes For Real Estate dataset, we use 67,477 scenes for training and 7,289 scenes for testing. For Epic-Field, we use 611 scenes for training and 88 scenes for testing. We combined the hard, medium, and simple groups in a 1:1:1 ratio to adequately test the camera poses.
Hardware Specification Yes All models are trained on 8 NVIDIA A100 GPUs for 300k iterations using an effective batch size 32.
Software Dependencies No The paper mentions using "AdamW optimizer" and "DDIM scheduler" and "BF16 precision" but does not provide specific version numbers for any software libraries or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes All models are trained on 8 NVIDIA A100 GPUs for 300k iterations using an effective batch size 32. We use BF16 precision for training SVD. We use DDIM scheduler with 1000 steps during training and 25 steps during inference. During the training, we use sample stride of 6 for Real Estate and 4 for Epic-Field. We also used several additional techniques for data augmentation. In Real Estate, since most scenes are static, we can reverse the videos to generate additional motion trajectories. In both datasets, we can randomly increase or decrease the sample stride by one step to obtain video clips with different speeds. As a result, each training sample is consist of a 14-frame video clip, a text prompt and camera poses for all frames.