reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Latte: Latent Diffusion Transformer for Video Generation

Authors: Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, Yu Qiao

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., Face Forensics, Sky Timelapse, UCF101, and Taichi-HD. In addition, we conduct a comprehensive ablation analysis, encompassing video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our analysis identifies best practices that enable Latte to generate photorealistic videos with temporal coherent content (see Fig. 1) and achieve state-of-the-art performance across four standard video generation benchmarks.
Researcher Affiliation	Collaboration	1Department of Data Science & AI, Faculty of Information Technology, Monash University 2Shanghai AI Laboratory 3Nanjing University of Posts and Telecommunications 4S-Lab, Nanyang Technological University
Pseudocode	No	The paper includes architectural diagrams (Figure 2 and Figure 10) that illustrate the structure of the Transformer blocks and S-Ada LN, but it does not present any formal pseudocode or algorithm blocks describing the procedures or methods in a step-by-step, code-like format.
Open Source Code	No	The project page is available at https://maxin-cn.github.io/latte_project/. This is a project page that provides an overview of the project and samples, but it does not explicitly state that the source code for the methodology described in the paper is provided at this link, nor is it a direct link to a code repository.
Open Datasets	Yes	We primarily conduct comprehensive experiments on four public datasets: Face Forensics Rössler et al. (2018), Sky Timelapse Xiong et al. (2018), UCF101 Soomro et al. (2012), and Taichi-HD Siarohin et al. (2019).
Dataset Splits	No	Following the experimental setup in Skorokhodov et al. (2022), except for UCF101, we use the training split for all datasets if they are available. For UCF101, we use both training and testing splits. We extract 16-frame video clips from these datasets using a specific sampling interval, with each frame resized to 256 256 resolution for training.
Hardware Specification	Yes	All experiments are conducted on 8 NVIDIA A100 (80G) GPUs.
Software Dependencies	Yes	We borrow the pre-trained variational autoencoder from Stable Diffusion 1.4.
Experiment Setup	Yes	We use the Adam W optimizer with a constant learning rate 1 10 4 to train all models. Horizontal flipping is the only employed data augmentation. Following common practices within generative modeling works Peebles & Xie (2023); Bao et al. (2023), an exponential moving average (EMA) of Latte weights is upheld throughout training, employing a decay rate of 0.9999. A series of N Transformer blocks are used to construct our Latte model, and the hidden dimension of each Transformer block is D with N multi-head attention. Following Vi T, we identify four configurations of Latte with different numbers of parameters as shown in Tab. 2.