MarDini: Masked Auto-regressive Diffusion for Video Generation at Scale

Authors: Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan Camilo Perez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, Jui-Chieh Wu, Sen He, Tao Xiang, Jürgen Schmidhuber, Juan-Manuel Perez-Rua

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical study on Mar Dini highlights the following key characteristics: Flexibility. Scalability. Efficiency. ... We evaluate Mar Dini on two benchmarks: VIDIM-Bench (Jain et al., 2024), for long-term video interpolation, and VBench (Huang et al., 2024) for image-to-video generation. We further elaborate on the specifics of these benchmarks in Appendix D. ... 3.1 Ablation Studies and Analysis
Researcher Affiliation Collaboration Haozhe Liu1,2, , Shikun Liu2, , Zijian Zhou2, , Mengmeng Xu2, Yanping Xie 2, Xiao Han2, Juan C. Pérez2, Ding Liu2, Kumara Kahatapitiya2, Menglin Jia2, Jui-Chieh Wu2, Sen He2, Tao Xiang2, Jürgen Schmidhuber1, Juan-Manuel Perez-Rua2, 1 KAUST 2 Meta AI Equal Contribution Correspondence: EMAIL; EMAIL
Pseudocode No The paper describes its architecture and training pipeline using textual descriptions and figures (Figure 1 and 2 illustrate design details), but it does not include any explicitly labeled pseudocode or algorithm blocks. For example, Section 2.1 'Design Overview' and Section 2.4 'Mar Dini Training Recipes' describe the processes but not in a structured pseudocode format.
Open Source Code No We ensure reproducibility by providing detailed model configurations in Appendix A, along with the complete training recipes outlined in Appendix B. However, due to organizational policies, the model was trained using internal infrastructure and proprietary dependencies that cannot be publicly released. Additionally, the VAE component is based on internal product data and was not developed within this project, further restricting the potential release of model weights.
Open Datasets Yes We evaluate Mar Dini on two benchmarks: VIDIM-Bench (Jain et al., 2024), for long-term video interpolation, and VBench (Huang et al., 2024) for image-to-video generation. ... For VIDIM-Bench, the task involves generating seven intermediate frames... The dataset includes approximately 400 videos from both DAVIS (Pont-Tuset et al., 2017) and UCF-101 (Soomro et al., 2012). ... The data is sourced from public research data (Downs et al., 2022). ... The video data used for visualization is sourced from public research data (Nan et al., 2024).
Dataset Splits No The paper mentions using specific datasets like VIDIM-Bench, VBench, DAVIS, UCF-101 for evaluation, and a licensed Shutterstock dataset for training. For VIDIM-Bench, it describes the evaluation task (generating 7 intermediate frames) and metrics, and for VBench, it states 'we utilize the official dataset to assess the model'. However, it does not explicitly provide specific training/test/validation split percentages, sample counts, or a detailed splitting methodology for its own training or evaluation setup beyond using the standard evaluation approach for the benchmarks.
Hardware Specification Yes All experiments, including model variations, ablation studies, benchmark evaluations, and full model training, are carried out on a distributed MAST scheduler (Choudhury et al., 2024) using 256 H100 GPUs. ... Table 2: Efficiency of the Mar Dini s generations with and without the asymmetric design. Both latency and GPU memory is measured as the average time to generate a video using DDIM with 25 steps using a single A100 GPU, and with bf16 mixed precision. ... All models were trained under identical configurations, including a batch size of 512, video clips of 4 frames, a resolution of 256, and a mask ratio ranging from 0.65 to 1.00. Training was conducted on 16 H100 GPUs and evaluated at the same selected training steps in each training configuration.
Software Dependencies No The paper mentions using the AdamW optimizer and FSDP (a feature within PyTorch) for training, and DDIM for inference. It also states 'All implementations are based on their official code'. However, it does not provide specific version numbers for any programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other software components used in the implementation.
Experiment Setup Yes In Figure 6, we present the training details of Mar Dini. All experiments... are carried out on a distributed MAST scheduler (Choudhury et al., 2024) using 256 H100 GPUs. The training dataset comprises approximately 34 million filtered Shutterstock videos, segmented into 2-second training clips. We use the Adam W optimizer for each stage with a 1.4 10 4 learning rate and cosine learning rate scheduler. We adapt our batch size based on the resolution and the frame count to maximize GPU utility. For example, at [256 256] resolution with 9 frames, the batch size is 1024, processing 9K frames per iteration; at [512 512] resolution with 9 frames, the batch size is 720, processing 6480 frames per iteration. During inference, we set the classifier-free guidance (CFG)(Ho & Salimans, 2022) scale as 2.5 for the image-to-video task with the noise solver DDIM (Song et al., 2021), and we directly remove classifier-free guidance for video interpolation as it is redundant. FSDP (Zhao et al., 2023) and activation checkpointing (Zhao et al., 2023) are enabled to further save GPU memory.