T2V-Turbo-v2: Enhancing Video Model Post-Training through Data, Reward, and Conditional Guidance Design

Authors: Jiachen Li, Qian Long, Jian (Skyler) Zheng, Xiaofeng Gao, Robinson Piramuthu, Wenhu Chen, William Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through comprehensive ablation studies, we highlight the crucial importance of tailoring datasets to specific learning objectives and the effectiveness of learning from diverse reward models for enhancing both the visual quality and text-video alignment. Additionally, we highlight the vast design space of conditional guidance strategies, which centers on designing an effective energy function to augment the teacher ODE solver. We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver, showcasing its effectiveness in improving the motion quality of the generated videos with the improved motion-related metrics from VBench and T2V-Comp Bench. Empirically, our T2V-Turbo-v2 establishes a new state-of-the-art result on VBench, with a Total score of 85.13, surpassing proprietary systems such as Gen-3 and Kling.
Researcher Affiliation Collaboration Jiachen Li1, Qian Long2, Jian Zheng3 , Xiaofeng Gao3 , Robinson Piramuthu3 , Wenhu Chen4, William Yang Wang1 1UC Santa Barbara, 2UC Los Angeles, 3Amazon AGI, 4University of Waterloo 1EMAIL, EMAIL, 3EMAIL EMAIL
Pseudocode Yes A PSEUDO-CODES OF OUR T2V-TURBO-V2 S DATA PREPROCESSING AND TRAINING PIPELINE Algorithm 1 and Algorithm 2 presents the pseudo-codes for data preprocessing and training, respectively.
Open Source Code Yes REPRODUCIBILITY STATEMENT Our experiments are conducted with all open-sourced codes and training data. Our implementation codes have been included in the supplementary material and will be released to the public in a Git Hub repository without breaking the double-blind rules.
Open Datasets Yes We experiment with Vid Gen-1M (Tan et al., 2024) (VG), Open Vid-1M (Nan et al., 2024) (OV), Web Vid-10M (Bain et al., 2021) (WV), and their combinations.
Dataset Splits No We train on a mixed dataset VG + WV, which consists of equal portions of Vid Gen-1M (Tan et al., 2024) and Web Vid-10M (Bain et al., 2021). While the CD loss is optimized across the entire dataset, the reward objective Eq. 10 is optimized using only Web Vid data. To evaluate the 16-step generation of our method and T2V-Turbo, we carefully follow VBench s evaluation protocols by generating 5 videos for each prompt. The paper describes how data is used for different objectives but does not specify explicit train/test/validation splits for the datasets.
Hardware Specification Yes All our models are trained on 8 NVIDIA A100 GPUs for 8K gradient steps without gradient accumulation.
Software Dependencies No No specific software versions (e.g., Python, PyTorch, CUDA versions) are mentioned in the paper.
Experiment Setup Yes Settings. We distill our T2V-Turbo-v2 from Video Crafter2 (Chen et al., 2024a). All our models are trained on 8 NVIDIA A100 GPUs for 8K gradient steps without gradient accumulation. We use a batch size of 3 to calculate the CD loss and 1 to optimize the reward objective on each GPU device. During optimization of the image-text reward model Rimg, we randomly sample 2 frames from each video by setting M = 2. The learning rate is set to 1e 5, and the guidance scale is defined within the range [ωmin, ωmax] = [5, 15]. We use DDIM (Song et al., 2020a) as our ODE solver Ψ, with a skipping step parameter of k = 5. For motion guidance (MG), we set the motion guidance percentage τ = 0.5 and strength λ = 500.