reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

EgoSim: Egocentric Exploration in Virtual Worlds with Multi-modal Conditioning

Authors: Wei Yu, Songheng Yin, Steve Easterbrook, Animesh Garg

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we demonstrate that our new model Ego Sim achieves excellent results on both the Real Estate and newly repurposed Epic-Field datasets. We evaluate the proposed method Ego Sim on two competitive benchmarks, Real Estate and Epic Field datasets, in a multi-condition-input setting. Extensive experimental results show that our model achieves precise control that previous methods could not accomplish. This paper introduces Ego Sim, a compositional world simulator that can egocentrically explore and interact with the observed environment. To achieve this, we identify two major obstacles that prevent the creation of meaningful videos that accurately adhere to user-specified instructions and tackle them with spacetime epipolar attention and CI2V-adapter. Our designed model has demonstrated unprecedented controllability and the potential uses of such an egocentric world simulator are also diverse and impactful.
Researcher Affiliation	Collaboration	Wei Yu1,2, Songheng Yin3, Steve Easterbrook1, Animesh Garg2,4,5 1University of Toronto, 2Vector Institute, 3Columbia University 4NVIDIA 5Georgia Tech
Pseudocode	No	The paper includes mathematical equations for epipolar line calculation and attention (Equations 1, 3, 4) and a formula for Epipolar Attention, but no structured pseudocode or algorithm blocks are provided.
Open Source Code	No	The paper states: "For more results, please refer to https://egosim.github.io/Ego Sim/." This link is described as being for "more results" and is a project page, not an explicit statement of code release or a direct link to a code repository. The text does not provide a concrete access point for the source code of the methodology described in the paper.
Open Datasets	Yes	Through extensive experiments, we demonstrate that our new model Ego Sim achieves excellent results on both the Real Estate and newly repurposed Epic-Field datasets. For more results, please refer to https://egosim.github.io/Ego Sim/. We repurpose the Epic-Field dataset for egocentric video generation and established a new benchmark to evaluate video diffusion models in more interactive and dynamic settings. We evaluate the proposed method Ego Sim on two competitive benchmarks, Real Estate and Epic Field datasets, in a multi-condition-input setting. Real Estate Zhou et al. (2018) consists of a large number of open house video tours. Fortunately, we have identified a highly suitable dataset, Epic-Field Damen et al. (2018), which, although not previously used in video generation research, aligns perfectly with our requirements.
Dataset Splits	Yes	For Real Estate dataset, we use 67,477 scenes for training and 7,289 scenes for testing. For Epic-Field, we use 611 scenes for training and 88 scenes for testing. We combined the hard, medium, and simple groups in a 1:1:1 ratio to adequately test the camera poses.
Hardware Specification	Yes	All models are trained on 8 NVIDIA A100 GPUs for 300k iterations using an effective batch size 32.
Software Dependencies	No	The paper mentions using "AdamW optimizer" and "DDIM scheduler" and "BF16 precision" but does not provide specific version numbers for any software libraries or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	All models are trained on 8 NVIDIA A100 GPUs for 300k iterations using an effective batch size 32. We use BF16 precision for training SVD. We use DDIM scheduler with 1000 steps during training and 25 steps during inference. During the training, we use sample stride of 6 for Real Estate and 4 for Epic-Field. We also used several additional techniques for data augmentation. In Real Estate, since most scenes are static, we can reverse the videos to generate additional motion trajectories. In both datasets, we can randomly increase or decrease the sample stride by one step to obtain video clips with different speeds. As a result, each training sample is consist of a 14-frame video clip, a text prompt and camera poses for all frames.