reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VideoPhy: Evaluating Physical Commonsense for Video Generation

Authors: Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, Aditya Grover

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts, while also lack physical commonsense. Specifically, we curate diverse prompts that involve interactions between various material types in the physical world (e.g., solid-solid, solid-fluid, fluid-fluid). We then generate videos conditioned on these captions from diverse state-of-the-art text-to-video generative models, including open models (e.g., Cog Video X) and closed models (e.g., Lumiere, Dream Machine).
Researcher Affiliation	Collaboration	1University of California Los Angeles 2Google Research
Pseudocode	No	No explicit pseudocode or algorithm blocks are provided in the main text. The paper describes a three-stage pipeline for dataset construction but does so in paragraph form, not as structured pseudocode.
Open Source Code	Yes	Code: https://github.com/Hritikbansal/videophy.
Open Datasets	Yes	To this end, we propose VIDEOPHY, a dataset designed to evaluate the adherence of generated videos to physical commonsense in real-world scenarios.
Dataset Splits	Yes	To facilitate this, we split the prompts in the VIDEOPHY dataset equally into train and test sets. Specifically, we utilize the human annotations on the generated videos for the 344 prompts in the test set for benchmarking, while the human annotations on the generated videos for the 344 prompts in the train set are used for training the automatic evaluation model.
Hardware Specification	Yes	We utilized 2 A6000 GPUs with the total batch size of 32.
Software Dependencies	No	The paper mentions software like VIDEOCON, optimizers like Adam, and noise schedulers like DDPM, DPMSolver, DDIM, and Euler Discrete. It also specifies using Low-Rank Adaptation (LoRA). However, it does not provide specific version numbers for these or other key software dependencies.
Experiment Setup	Yes	To create VIDEOCON-PHYSICS, we use low-rank adaptation (Lo RA) [39] of the VIDEOCON applied to all the layers of the attention blocks including QKVO, gate, up and down projection matrices. We set the Lo RA r = 32 and α = 32 and dropout = 0.05. The finetuning is performed for 5 epochs using Adam [46] optimizer with a linear warmup of 50 steps followed by linear decay. Similar to [5], we chose the peak learning rate as 1e 4. We utilized 2 A6000 GPUs with the total batch size of 32.