VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis
Authors: Yumeng Li, William H Beluch, Margret Keuper, Dan Zhang, Anna Khoreva
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally showcase the superiority of our method in synthesizing longer, visually appealing videos over open-sourced T2V models. |
| Researcher Affiliation | Collaboration | 1Amazon 2Bosch Center for Artificial Intelligence 3University of Mannheim 4Max Planck Institute for Informatics 5Zalando |
| Pseudocode | No | The paper describes methods and equations for Temporal Attention Regularization and Video Synopsis Prompting but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We plan to release the code upon acceptance. |
| Open Datasets | Yes | Experimental setting. To demonstrate the effectiveness of VSTAR in creating more dynamic videos, we run experiments and ablations on Chrono Magic-Bench-150 (Yuan et al., 2024) and prompts generated by Chat GPT (Open AI, 2022) describing various visual transitions. [...] For our analysis, we use Video Crafter2 (Chen et al., 2024a) along with videos from the DAVIS dataset (Perazzi et al., 2016) and additional videos collected from the web. |
| Dataset Splits | No | The paper mentions using Chrono Magic-Bench-150 and DAVIS dataset but does not specify any particular training, testing, or validation splits used for its experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments. |
| Software Dependencies | No | The paper does not specify version numbers for any software libraries, frameworks, or programming languages used in the implementation. |
| Experiment Setup | Yes | By default, we employ the state-of-the-art open-sourced T2V model Video Crafter2 (Chen et al., 2024a) with 320 × 512 resolution as our base model, which is combined with the proposed video synopsis prompting (VSP) and temporal attention regularization (TAR). |