T2V-Turbo-v2: Enhancing Video Model Post-Training through Data, Reward, and Conditional Guidance Design
Authors: Jiachen Li, Qian Long, Jian (Skyler) Zheng, Xiaofeng Gao, Robinson Piramuthu, Wenhu Chen, William Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through comprehensive ablation studies, we highlight the crucial importance of tailoring datasets to specific learning objectives and the effectiveness of learning from diverse reward models for enhancing both the visual quality and text-video alignment. Additionally, we highlight the vast design space of conditional guidance strategies, which centers on designing an effective energy function to augment the teacher ODE solver. We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver, showcasing its effectiveness in improving the motion quality of the generated videos with the improved motion-related metrics from VBench and T2V-Comp Bench. Empirically, our T2V-Turbo-v2 establishes a new state-of-the-art result on VBench, with a Total score of 85.13, surpassing proprietary systems such as Gen-3 and Kling. |
| Researcher Affiliation | Collaboration | Jiachen Li1, Qian Long2, Jian Zheng3 , Xiaofeng Gao3 , Robinson Piramuthu3 , Wenhu Chen4, William Yang Wang1 1UC Santa Barbara, 2UC Los Angeles, 3Amazon AGI, 4University of Waterloo 1EMAIL, EMAIL, 3EMAIL EMAIL |
| Pseudocode | Yes | A PSEUDO-CODES OF OUR T2V-TURBO-V2 S DATA PREPROCESSING AND TRAINING PIPELINE Algorithm 1 and Algorithm 2 presents the pseudo-codes for data preprocessing and training, respectively. |
| Open Source Code | Yes | REPRODUCIBILITY STATEMENT Our experiments are conducted with all open-sourced codes and training data. Our implementation codes have been included in the supplementary material and will be released to the public in a Git Hub repository without breaking the double-blind rules. |
| Open Datasets | Yes | We experiment with Vid Gen-1M (Tan et al., 2024) (VG), Open Vid-1M (Nan et al., 2024) (OV), Web Vid-10M (Bain et al., 2021) (WV), and their combinations. |
| Dataset Splits | No | We train on a mixed dataset VG + WV, which consists of equal portions of Vid Gen-1M (Tan et al., 2024) and Web Vid-10M (Bain et al., 2021). While the CD loss is optimized across the entire dataset, the reward objective Eq. 10 is optimized using only Web Vid data. To evaluate the 16-step generation of our method and T2V-Turbo, we carefully follow VBench s evaluation protocols by generating 5 videos for each prompt. The paper describes how data is used for different objectives but does not specify explicit train/test/validation splits for the datasets. |
| Hardware Specification | Yes | All our models are trained on 8 NVIDIA A100 GPUs for 8K gradient steps without gradient accumulation. |
| Software Dependencies | No | No specific software versions (e.g., Python, PyTorch, CUDA versions) are mentioned in the paper. |
| Experiment Setup | Yes | Settings. We distill our T2V-Turbo-v2 from Video Crafter2 (Chen et al., 2024a). All our models are trained on 8 NVIDIA A100 GPUs for 8K gradient steps without gradient accumulation. We use a batch size of 3 to calculate the CD loss and 1 to optimize the reward objective on each GPU device. During optimization of the image-text reward model Rimg, we randomly sample 2 frames from each video by setting M = 2. The learning rate is set to 1e 5, and the guidance scale is defined within the range [ωmin, ωmax] = [5, 15]. We use DDIM (Song et al., 2020a) as our ODE solver Ψ, with a skipping step parameter of k = 5. For motion guidance (MG), we set the motion guidance percentage τ = 0.5 and strength λ = 500. |