VideoPhy: Evaluating Physical Commonsense for Video Generation

Authors: Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, Aditya Grover

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts, while also lack physical commonsense. Specifically, we curate diverse prompts that involve interactions between various material types in the physical world (e.g., solid-solid, solid-fluid, fluid-fluid). We then generate videos conditioned on these captions from diverse state-of-the-art text-to-video generative models, including open models (e.g., Cog Video X) and closed models (e.g., Lumiere, Dream Machine).
Researcher Affiliation Collaboration 1University of California Los Angeles 2Google Research
Pseudocode No No explicit pseudocode or algorithm blocks are provided in the main text. The paper describes a three-stage pipeline for dataset construction but does so in paragraph form, not as structured pseudocode.
Open Source Code Yes Code: https://github.com/Hritikbansal/videophy.
Open Datasets Yes To this end, we propose VIDEOPHY, a dataset designed to evaluate the adherence of generated videos to physical commonsense in real-world scenarios.
Dataset Splits Yes To facilitate this, we split the prompts in the VIDEOPHY dataset equally into train and test sets. Specifically, we utilize the human annotations on the generated videos for the 344 prompts in the test set for benchmarking, while the human annotations on the generated videos for the 344 prompts in the train set are used for training the automatic evaluation model.
Hardware Specification Yes We utilized 2 A6000 GPUs with the total batch size of 32.
Software Dependencies No The paper mentions software like VIDEOCON, optimizers like Adam, and noise schedulers like DDPM, DPMSolver, DDIM, and Euler Discrete. It also specifies using Low-Rank Adaptation (LoRA). However, it does not provide specific version numbers for these or other key software dependencies.
Experiment Setup Yes To create VIDEOCON-PHYSICS, we use low-rank adaptation (Lo RA) [39] of the VIDEOCON applied to all the layers of the attention blocks including QKVO, gate, up and down projection matrices. We set the Lo RA r = 32 and α = 32 and dropout = 0.05. The finetuning is performed for 5 epochs using Adam [46] optimizer with a linear warmup of 50 steps followed by linear decay. Similar to [5], we chose the peak learning rate as 1e 4. We utilized 2 A6000 GPUs with the total batch size of 32.