SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

Authors: Yining Hong, Beide Liu, Maxine Wu, Yuanhao Zhai, Kai-Wei Chang, Linjie Li, Kevin Lin, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, Yingnian Wu, Lijuan Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that SLOWFAST-VGEN outperforms baselines across various metrics for action-driven video generation, achieving an FVD score of 514 compared to 782, and maintaining consistency in longer videos, with an average of 0.37 scene cuts versus 0.89. The slow-fast learning loop algorithm significantly enhances performances on long-horizon planning tasks as well. Project Website: https://slowfast-vgen.github.io
Researcher Affiliation Collaboration 1 UCLA, 2 Microsoft Research, 3 State University of New York at Buffalo
Pseudocode Yes Algorithm 1 SLOWFAST-VGEN Algorithm (a) Fast Learning Input: First frame of video sequence X0, total generating iterations I, video diffusion model with frozen pretrained weights Φ and Lo RA parameters Θ0, fast learning learning rate α Output: Long Video Sequence Y 1: X0 = VAE ENCODE(X0); Y X0 {Encode into latents} 2: for i in 0 to I 1 do 3: // Generate the sequence in the current context window 4: if i = 0 then 5: Xi Yi 1 6: end if 7: Ci User Input(i) {Action conditioning acquired through user interface input} 8: Yi = (Φ + Θi)(Xi, Ci) 9: Y = Y Yi {Concatenate the output latents to the final sequence latents} 10: // Use the input and output in the context window to train TEMP-LORA 11: Xi = Xi Yi {Concatenate input and output to prepare TEMP-LORA training data} 12: Sample Noise N on the whole Xi sequence 13: Calculate Loss on the whole Xi sequence 14: Θi+1 Θi α θLoss 15: end for 16: Y = VAE DECODE(Y) {Decode into video} 17: return Y (b) Slow-Fast Learning Loop Definition: task-specific slow learning weights Φ, task-specific dataset D, Lo RA parameters of all episodes Θ, slow learning learning rate β, slow-learning dataset Ds // Slow Learning Loop 1: while not converged do 2: Ds //prepare dataset for slow learning 3: for each sample (x, episode) in D do 4: // Fast Learning Loop 5: Suppose episode could be divided into I short sequences: X e i for i in 0 to I 1 6: Initialize TEMP-LORA parameters for this episode Θe 0 7: for i in 0 to I 1 do 8: Ds = Ds {Xe i , Xe i+1, Θe i } {Xe i+1 is the ground-truth output of input Xe i } 9: Fix Φ and update Θe i using fast learning algorithm 10: end for 11: end for 12: // Use the Ds dataset for slow learning update 13: for {Xe i , Xe i+1, Θe i } in Ds do 14: Φe i = Φ + Θe i 15: Calculate Loss based on the model output of input Xe i , and ground-truth output Xe i+1 16: Fix Θe i and update Φ only: Φ Φ β ΦLoss 17: end for 18: end while
Open Source Code No Project Website: https://slowfast-vgen.github.io. This is a project website, but the paper does not explicitly state that source code for the methodology is available there, nor does it provide a direct link to a code repository.
Open Datasets No To facilitate the slow learning of an approximate world model, we collect a large-scale dataset of 200k videos with language action annotations, covering a wide range of scenarios. Our slow learning dataset consists of 200k data, with each data point in the format of input video chunk, input free-text action, output video chunk . The dataset can be categorized into 4 domains : Unreal. We utilize the Unreal Game Engine (Epic Games) for data collection... Game. We manually play Minecraft, recording keyboard and mouse inputs and capturing videos. Human Activities. We include the EPIC-KITCHENS (Damen et al., 2018; 2022) dataset... Robot. We use several datasets from Open X-Embodiment (Collaboration et al., 2024), as well as tasks from Metaworld (Yu et al., 2021) and RLBench (James et al., 2019). Driving. We utilize the HRI Driving Dataset (HDD) (Ramanishka et al., 2018). While some components of their collected 200k video dataset are public and cited, the paper does not provide concrete access information for the aggregated 200k video dataset itself, which includes data they collected from Unreal and Minecraft.
Dataset Splits No We reserved a portion of our collected dataset as the test set, ensuring no scene overlap with the training set. The paper mentions a training and test set split but does not provide specific percentages or sample counts for these splits.
Hardware Specification Yes We utilize approximately 64 V100 GPUs for the pre-training of SLOWFAST-VGEN, with a batch size of 128. The slow learning rate is set to 5e-6, while the fast learning rate is 1e-4. Training videos of mixed lengths are used, all within the context window of 32 frames. During training, we freeze the VAE and CLIP Encoder, allowing only the UNet to be trained. For inference and fast learning, we employ a single V100 GPU.
Software Dependencies No The paper mentions using components like Model Scope T2V, CLIP encoder, and UNet architectures, and tools such as Py Scene Detect, but it does not specify any version numbers for key software dependencies like Python, PyTorch, CUDA, or other libraries used in their implementation.
Experiment Setup Yes We utilize approximately 64 V100 GPUs for the pre-training of SLOWFAST-VGEN, with a batch size of 128. The slow learning rate is set to 5e-6, while the fast learning rate is 1e-4. Training videos of mixed lengths are used, all within the context window of 32 frames. During training, we freeze the VAE and CLIP Encoder, allowing only the UNet to be trained. For inference and fast learning, we employ a single V100 GPU. For TEMP-LORA, a Lo RA rank of 32 is used, and the Adam optimizer is employed in both learning phases.