Diffusion Adversarial Post-Training for One-Step Video Generation

Authors: Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, Lu Jiang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our experiments show that our adversarial post-trained model can generate two-second, 1280 720, 24fps videos in real-time using a single forward evaluation step. Additionally, our model is capable of generating 1024px images in a single step, achieving quality comparable to state-of-the-art methods.
Researcher Affiliation Industry 1Byte Dance Seed. Correspondence to: Shanchuan Lin <EMAIL>.
Pseudocode No The paper describes the methodology using prose and mathematical equations (e.g., Equation 1, 2, 8, 9, 10) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Our project page: https://seaweed-apt.com/. This is a project page which might contain demonstrations or general information, but it is not an explicit statement of code release for the methodology described in this paper, nor a direct link to a code repository.
Open Datasets Yes Additionally, following the previous works (Lin et al., 2024; Sauer et al., 2024; 2025), we also report the FID (Heusel et al., 2017), PFID (Lin et al., 2024), and CLIP (Radford et al., 2021) metrics on COCO dataset (Lin et al., 2014).
Dataset Splits No For image evaluation, we follow the evaluation protocol in previous diffusion distillation works (Sauer et al., 2024; 2025) and generate samples using 300 randomly selected prompts from Parti Prompt (Yu et al., 2022a) and Draw Bench (Saharia et al., 2022). We generate 3 images per prompt. For video evaluation, we generate one video per 96 custom prompts.
Hardware Specification Yes On a H100 GPU, our model can generate a two-second 1280 720 24fps video latent using a single step in two seconds. On 8 H100 GPUs with parallelization, the entire pipeline with text encoder and latent decoder runs in real-time. ... We use 128 256 H100 GPUs with a batch size of 9062. ... We use 1024 H100 GPUs with gradient accumulation to reach a batch size of 2048.
Software Dependencies No However, Py Torch FSDP (Zhao et al., 2023), gradient checkpointing (Chen et al., 2016), Flash Attention (Dao et al., 2022; Dao, 2023; Shah et al., 2024), and other fused operators (Nvidia-Apex) do not support higher-order gradient computation or double backward at the time of writing, preventing the use of R1 in large-scale transformer models. ... We use each model s default CFG (Ho & Salimans, 2021) as configured in diffusers (von Platen et al., 2022)...
Experiment Setup Yes We first train the model on only 1024px images. We use 128 256 H100 GPUs with a batch size of 9062. The learning rate is 5e 6 for both the generator and the discriminator. We find the model adapts quickly. We use an Exponential Moving Average (EMA) decay rate of 0.995 and adopt the EMA checkpoint after 350 updates on the generator before the quality starts to degrade. ... We use 1024 H100 GPUs with gradient accumulation to reach a batch size of 2048. We lower the learning rate to 3e 6 for stability and train it for 300 updates. ... RMSProp optimizer is used with α = 0.9, which is equivalent to Adam (Kingma & Ba, 2014) with β1 = 0, β2 = 0.9 with reduced memory consumption. We do not use weight decay or gradient clipping. The entire training is conducted in BF16 mixed precision.