reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

T2V-Turbo-v2: Enhancing Video Model Post-Training through Data, Reward, and Conditional Guidance Design

Authors: Jiachen Li, Qian Long, Jian (Skyler) Zheng, Xiaofeng Gao, Robinson Piramuthu, Wenhu Chen, William Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through comprehensive ablation studies, we highlight the crucial importance of tailoring datasets to specific learning objectives and the effectiveness of learning from diverse reward models for enhancing both the visual quality and text-video alignment. Additionally, we highlight the vast design space of conditional guidance strategies, which centers on designing an effective energy function to augment the teacher ODE solver. We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver, showcasing its effectiveness in improving the motion quality of the generated videos with the improved motion-related metrics from VBench and T2V-Comp Bench. Empirically, our T2V-Turbo-v2 establishes a new state-of-the-art result on VBench, with a Total score of 85.13, surpassing proprietary systems such as Gen-3 and Kling.
Researcher Affiliation	Collaboration	Jiachen Li1, Qian Long2, Jian Zheng3 , Xiaofeng Gao3 , Robinson Piramuthu3 , Wenhu Chen4, William Yang Wang1 1UC Santa Barbara, 2UC Los Angeles, 3Amazon AGI, 4University of Waterloo 1EMAIL, EMAIL, 3EMAIL EMAIL
Pseudocode	Yes	A PSEUDO-CODES OF OUR T2V-TURBO-V2 S DATA PREPROCESSING AND TRAINING PIPELINE Algorithm 1 and Algorithm 2 presents the pseudo-codes for data preprocessing and training, respectively.
Open Source Code	Yes	REPRODUCIBILITY STATEMENT Our experiments are conducted with all open-sourced codes and training data. Our implementation codes have been included in the supplementary material and will be released to the public in a Git Hub repository without breaking the double-blind rules.
Open Datasets	Yes	We experiment with Vid Gen-1M (Tan et al., 2024) (VG), Open Vid-1M (Nan et al., 2024) (OV), Web Vid-10M (Bain et al., 2021) (WV), and their combinations.
Dataset Splits	No	We train on a mixed dataset VG + WV, which consists of equal portions of Vid Gen-1M (Tan et al., 2024) and Web Vid-10M (Bain et al., 2021). While the CD loss is optimized across the entire dataset, the reward objective Eq. 10 is optimized using only Web Vid data. To evaluate the 16-step generation of our method and T2V-Turbo, we carefully follow VBench s evaluation protocols by generating 5 videos for each prompt. The paper describes how data is used for different objectives but does not specify explicit train/test/validation splits for the datasets.
Hardware Specification	Yes	All our models are trained on 8 NVIDIA A100 GPUs for 8K gradient steps without gradient accumulation.
Software Dependencies	No	No specific software versions (e.g., Python, PyTorch, CUDA versions) are mentioned in the paper.
Experiment Setup	Yes	Settings. We distill our T2V-Turbo-v2 from Video Crafter2 (Chen et al., 2024a). All our models are trained on 8 NVIDIA A100 GPUs for 8K gradient steps without gradient accumulation. We use a batch size of 3 to calculate the CD loss and 1 to optimize the reward objective on each GPU device. During optimization of the image-text reward model Rimg, we randomly sample 2 frames from each video by setting M = 2. The learning rate is set to 1e 5, and the guidance scale is defined within the range [ωmin, ωmax] = [5, 15]. We use DDIM (Song et al., 2020a) as our ODE solver Ψ, with a skipping step parameter of k = 5. For motion guidance (MG), we set the motion guidance percentage τ = 0.5 and strength λ = 500.