VideoPhy: Evaluating Physical Commonsense for Video Generation
Authors: Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, Aditya Grover
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts, while also lack physical commonsense. Specifically, we curate diverse prompts that involve interactions between various material types in the physical world (e.g., solid-solid, solid-fluid, fluid-fluid). We then generate videos conditioned on these captions from diverse state-of-the-art text-to-video generative models, including open models (e.g., Cog Video X) and closed models (e.g., Lumiere, Dream Machine). |
| Researcher Affiliation | Collaboration | 1University of California Los Angeles 2Google Research |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are provided in the main text. The paper describes a three-stage pipeline for dataset construction but does so in paragraph form, not as structured pseudocode. |
| Open Source Code | Yes | Code: https://github.com/Hritikbansal/videophy. |
| Open Datasets | Yes | To this end, we propose VIDEOPHY, a dataset designed to evaluate the adherence of generated videos to physical commonsense in real-world scenarios. |
| Dataset Splits | Yes | To facilitate this, we split the prompts in the VIDEOPHY dataset equally into train and test sets. Specifically, we utilize the human annotations on the generated videos for the 344 prompts in the test set for benchmarking, while the human annotations on the generated videos for the 344 prompts in the train set are used for training the automatic evaluation model. |
| Hardware Specification | Yes | We utilized 2 A6000 GPUs with the total batch size of 32. |
| Software Dependencies | No | The paper mentions software like VIDEOCON, optimizers like Adam, and noise schedulers like DDPM, DPMSolver, DDIM, and Euler Discrete. It also specifies using Low-Rank Adaptation (LoRA). However, it does not provide specific version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | To create VIDEOCON-PHYSICS, we use low-rank adaptation (Lo RA) [39] of the VIDEOCON applied to all the layers of the attention blocks including QKVO, gate, up and down projection matrices. We set the Lo RA r = 32 and α = 32 and dropout = 0.05. The finetuning is performed for 5 epochs using Adam [46] optimizer with a linear warmup of 50 steps followed by linear decay. Similar to [5], we chose the peak learning rate as 1e 4. We utilized 2 A6000 GPUs with the total batch size of 32. |