Fast Video Generation with Sliding Tile Attention

Authors: Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, Hao Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4. Experiments We evaluate STA on Hunyuan Video, a state-of-the-art open video Di T comparable to many proprietary ones1. We generate Hunyuan outputs with 117 frames at a 1280 × 768 resolution. After VAE compression and tokenization, this corresponds to a latent video of shape (30, 48, 80). Beyond video, we also apply STA on the leading image diffusion model, FLUX (Black-Forest, 2023), to demonstrate its effectiveness in 2D. We evaluate both efficiency and video quality. STA kernel’s efficiency is measured using standard metrics such as MFU and latency, as detailed in 4.1. For end-to-end speedup on Di T, we report measured wall-clock latency, excluding time spent on VAE and text encoder. For generated video quality, we find existing automated metrics are often unreliable. Following Polyak et al. (2024), we emphasize human evaluation and present the results in 4.2. For completeness, we also report automated metrics, including VBench (Huang et al., 2024), SSIM, PSNR, and CD-FVD (Ge et al., 2024).
Researcher Affiliation Academia 1University of California, San Diego 2University of Michigan, Ann Arbor 3Tsinghua University 4University of California, Berkeley 5Mohamed bin Zayed University of Artificial Intelligence. Correspondence to: Hao Zhang <EMAIL>.
Pseudocode Yes Algorithm 1 STA Mask Search Input: Transformer model M, Total steps T, Mask pattern list P, Keep first T0 timestep full attn Output: Dictionary dict that stores selected mask pattern for each head Initialize dict for t = T0 + 1 to T do for each layer head combination (l, h) in M do O (attn output of original (l, h) ) Initialize minimum loss Initialize best pattern null for each p in P do Mask head h for layer l using mask pattern p O (attn output of M after masking) loss MSE(O, O ) if loss < minimum loss then minimum loss loss best pattern p Record best pattern for (t, l, h) in dict return dict
Open Source Code Yes We make our codebase public at https://github.com/hao-ailab/Fast Video.
Open Datasets Yes We evaluate STA on Hunyuan Video, a state-of-the-art open video Di T comparable to many proprietary ones1. We generate Hunyuan outputs with 117 frames at a 1280 × 768 resolution. After VAE compression and tokenization, this corresponds to a latent video of shape (30, 48, 80). Beyond video, we also apply STA on the leading image diffusion model, FLUX (Black-Forest, 2023), to demonstrate its effectiveness in 2D. (...) Following Polyak et al. (2024), we emphasize human evaluation and present the results in 4.2. For completeness, we also report automated metrics, including VBench (Huang et al., 2024), SSIM, PSNR, and CD-FVD (Ge et al., 2024). We provide an example in Figure 6, with additional qualitative results available in Appendix Section G.
Dataset Splits No The paper discusses using datasets like "Hunyuan Video", "Movie Gen Bench", "VBench", and "COCO-2014 validation dataset" for evaluation and "2,000 synthetically generated videos" for finetuning. However, it does not explicitly provide specific train/test/validation split percentages, counts, or a detailed methodology for splitting these datasets within the scope of this paper, beyond using existing validation sets or sampling for evaluation.
Hardware Specification Yes Even with Flash Attention 3 (FA3) (Shah et al., 2024) and a high-end H100 GPU, Hunyuan Video (Hunyuan Video-Team, 2025) still requires 16 minutes to generate a 5-seconds 720p video. This bottleneck severely limits the practical deployment of Di Ts.
Software Dependencies No STA can be efficiently implemented with Flex Attention, which provides enough functionality to skip all empty blocks and avoid adding unnecessary intra-block masks on the dense blocks. We can further optimize the sparse attention masks by disaggregating the inter-block mask logic from the compute kernels. Thus, we implement our attention kernels based on Thunder Kittens (Spector et al., 2024) and Flash Attention3 (Shah et al., 2024). Our implementation splits the threadblock into compute warpgroups and data warpgroups, and the inter-block mask is completely managed by the data warpgroups. Each compute warpgroup is responsible for calculating one query block, which always resides in the SRAM (Split-Q (Dao, 2024)). The data warpgroup is responsible for asynchronously loading the KV blocks from HBM to SRAM. For each block of query, the data warpgroup needs to decide which key and value blocks the query block will attend to in STA and only load those blocks. Since the data warpgroups are asynchronous, the overhead of calculating the inter-block mask in STA and deciding which data to load can be hidden with overlapping. On the other hand, the compute worker is completely oblivious of the sparse attention pattern. It performs attention computation with the key value blocks in shared memory loaded by data workers, and once all data is consumed in the circular cache, the computation is finished. While these tools are mentioned, specific version numbers for them or other common software libraries are not provided.
Experiment Setup Yes We train on 2,000 synthetically generated videos from Hunyuan Video at a resolution of 1280 × 768 with 117 frames. The prompts are sourced from the Mixkit dataset (Lin et al., 2024). To reduce memory usage and accelerate training, we precompute VAE-encoded latents and text encoder states. Training involves fine-tuning for 1,600 steps with a batch size of 2 and a learning rate of 2e-5. We optimize using the loss function from Eq. (5) with coefficients α = 1, β = 0.5, and γ = 0.5. To prevent overfitting on a single guidance scale, we alternate between guidance scales of 1 and 6 at odd and even steps. The entire process runs on 8 H100 GPUs with FSDP and context parallelism for training (8 hours) and sequence parallelism for inference.