Autoregressive Video Generation without Vector Quantization

Authors: Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, Xinlong Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EXPERIMENT 4.1 EXPERIMENT SETUP 4.2 MAIN RESULTS 4.3 QUALITATIVE RESULTS 4.4 ABLATION STUDY
Researcher Affiliation Collaboration 1Beijing University of Posts and Telecommunications 2Key Laboratory of Intelligent Information Processing, ICT, CAS 3University of Chinese Academy of Sciences 4Dalian University of Technology 5Beijing Academy of Artificial Intelligence
Pseudocode No The paper describes the methodology in Section 3 using textual descriptions and mathematical formulations (e.g., equations 1, 2, 3, 4) and illustrative figures, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and models are publicly available at https://github.com/baaivision/NOVA.
Open Datasets Yes For text-to-image training, we initially curate 16M image-text pairs sourced from Data Comp (Gadre et al. (2024)), COYO (Byeon et al. (2022)), Unsplash (Unsplash Team (2020)), and Journey DB (Sun et al. (2024a)). To explore the scaling properties of NOVA, we expanded the dataset to approximately 600M image-text pairs by selecting more images that have a minimum aesthetic score of 5.0 from LAION (Schuhmann et al. (2022)), Data Comp and COYO. For text-to-video training, we select 19M video-text pairs on a subset (Lin et al. (2024)) of Panda-70M (Chen et al. (2024b)) and internal video-text pairs. We further collect 1M of high-resolution video-text pairs from Pexels (Pexels Team (2014)) to fine-tune our final video generation model.
Dataset Splits No The paper mentions several large datasets used for training (e.g., '16M image-text pairs sourced from Data Comp (Gadre et al. (2024)), COYO (Byeon et al. (2022)), Unsplash (Unsplash Team (2020)), and Journey DB (Sun et al. (2024a))', 'approximately 600M image-text pairs from LAION (Schuhmann et al. (2022))', '19M video-text pairs on a subset (Lin et al. (2024)) of Panda-70M (Chen et al. (2024b))'). However, it does not provide specific training/validation/test splits for these datasets for its own model training and evaluation, instead referring to external benchmarks for evaluation of generated samples.
Hardware Specification Yes Training details. NOVA is trained with sixteen A100 (40G) nodes.
Software Dependencies No The paper mentions using Open CV (cv2) for optical flow and a pre-trained language model, but does not provide specific version numbers for these or other key software libraries like PyTorch, TensorFlow, or Python, which are essential for reproducibility.
Experiment Setup Yes Training details. NOVA is trained with sixteen A100 (40G) nodes. We utilize the Adam W optimizer (Loshchilov et al. (2017)) (β1 = 0.9, β2 = 0.95) with a weight decay of 0.02 and a base learning rate of 1e-4 in all experiments. The peak learning rate is adjusted for different batch sizes during training using the scaling rule (Goyal (2017)) : lr = base lr batchsize/256. We train text-to-image models from scratch and then load these weights to train text-to-video models.