VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Authors: Yabo Zhang, Yuxiang Wei, Xianhui Lin, Zheng Hui, Peiran Ren, Xuansong Xie, Wangmeng Zuo

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have conducted experiments in extensive prompts under the combination of various T2V and T2I. The results show that Video Elevator not only improves the performance of T2V baselines with foundational T2I, but also facilitates stylistic video synthesis with personalized T2I. [...] 5 Experiments 5.1 Experimental settings 5.2 Comparisons with T2V baselines 5.3 Ablation studies
Researcher Affiliation Collaboration 1 Harbin Institute of Technology 2 Tongyi Lab EMAIL, zheng EMAIL, EMAIL
Pseudocode No The paper describes methods using text and mathematical equations (e.g., Eqn. 1 to Eqn. 11), but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/YBYBZhang/Video Elevator Project https://videoelevator.github.io/
Open Datasets Yes We evaluate Video Elevator and other baselines in two benchmarks: (i) VBench (Huang et al. 2023b) dataset that involves in a variety of content categories and contains 800 prompts; (ii) Video Creation dataset, which unifies creative prompts datasets of Make-A-Video (Singer et al. 2023) and Video LDM (Blattmann et al. 2023b) and consists of 100 prompts in total.
Dataset Splits No The paper mentions using 'VBENCH' and 'Video Creation dataset' and specifies their total number of prompts (800 and 100, respectively), but does not provide specific training, validation, or test splits or their proportions for these datasets.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper mentions various models and frameworks like 'Stable Diffusion V1.5 or V2.1-base', 'Animate Diff', 'Zero Scope', 'La Vie', 'T2I', 'T2V', 'LDM', 'U-Net', and evaluation metrics like 'CLIP score', 'CLIP-IQA', 'LAION aesthetic predictor'. However, it does not provide specific version numbers for any programming languages, libraries, or other software dependencies used for implementation.
Experiment Setup Yes Notably, when N is very small (e.g, N = 1), the synthesized video only contains coarse-grained motion, so we set N to 8 10 to add fine-grained one (refer to Appendix B). [...] Empirically, applying temporal motion refining in just a few timesteps (i.e, 4 5 steps) can ensure temporal consistency (refer to Appendix B).