Latte: Latent Diffusion Transformer for Video Generation

Authors: Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, Yu Qiao

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., Face Forensics, Sky Timelapse, UCF101, and Taichi-HD. In addition, we conduct a comprehensive ablation analysis, encompassing video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our analysis identifies best practices that enable Latte to generate photorealistic videos with temporal coherent content (see Fig. 1) and achieve state-of-the-art performance across four standard video generation benchmarks.
Researcher Affiliation Collaboration 1Department of Data Science & AI, Faculty of Information Technology, Monash University 2Shanghai AI Laboratory 3Nanjing University of Posts and Telecommunications 4S-Lab, Nanyang Technological University
Pseudocode No The paper includes architectural diagrams (Figure 2 and Figure 10) that illustrate the structure of the Transformer blocks and S-Ada LN, but it does not present any formal pseudocode or algorithm blocks describing the procedures or methods in a step-by-step, code-like format.
Open Source Code No The project page is available at https://maxin-cn.github.io/latte_project/. This is a project page that provides an overview of the project and samples, but it does not explicitly state that the source code for the methodology described in the paper is provided at this link, nor is it a direct link to a code repository.
Open Datasets Yes We primarily conduct comprehensive experiments on four public datasets: Face Forensics Rössler et al. (2018), Sky Timelapse Xiong et al. (2018), UCF101 Soomro et al. (2012), and Taichi-HD Siarohin et al. (2019).
Dataset Splits No Following the experimental setup in Skorokhodov et al. (2022), except for UCF101, we use the training split for all datasets if they are available. For UCF101, we use both training and testing splits. We extract 16-frame video clips from these datasets using a specific sampling interval, with each frame resized to 256 256 resolution for training.
Hardware Specification Yes All experiments are conducted on 8 NVIDIA A100 (80G) GPUs.
Software Dependencies Yes We borrow the pre-trained variational autoencoder from Stable Diffusion 1.4.
Experiment Setup Yes We use the Adam W optimizer with a constant learning rate 1 10 4 to train all models. Horizontal flipping is the only employed data augmentation. Following common practices within generative modeling works Peebles & Xie (2023); Bao et al. (2023), an exponential moving average (EMA) of Latte weights is upheld throughout training, employing a decay rate of 0.9999. A series of N Transformer blocks are used to construct our Latte model, and the hidden dimension of each Transformer block is D with N multi-head attention. Following Vi T, we identify four configurations of Latte with different numbers of parameters as shown in Tab. 2.