reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Autoregressive Video Generation without Vector Quantization

Authors: Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, Xinlong Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 EXPERIMENT 4.1 EXPERIMENT SETUP 4.2 MAIN RESULTS 4.3 QUALITATIVE RESULTS 4.4 ABLATION STUDY
Researcher Affiliation	Collaboration	1Beijing University of Posts and Telecommunications 2Key Laboratory of Intelligent Information Processing, ICT, CAS 3University of Chinese Academy of Sciences 4Dalian University of Technology 5Beijing Academy of Artificial Intelligence
Pseudocode	No	The paper describes the methodology in Section 3 using textual descriptions and mathematical formulations (e.g., equations 1, 2, 3, 4) and illustrative figures, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and models are publicly available at https://github.com/baaivision/NOVA.
Open Datasets	Yes	For text-to-image training, we initially curate 16M image-text pairs sourced from Data Comp (Gadre et al. (2024)), COYO (Byeon et al. (2022)), Unsplash (Unsplash Team (2020)), and Journey DB (Sun et al. (2024a)). To explore the scaling properties of NOVA, we expanded the dataset to approximately 600M image-text pairs by selecting more images that have a minimum aesthetic score of 5.0 from LAION (Schuhmann et al. (2022)), Data Comp and COYO. For text-to-video training, we select 19M video-text pairs on a subset (Lin et al. (2024)) of Panda-70M (Chen et al. (2024b)) and internal video-text pairs. We further collect 1M of high-resolution video-text pairs from Pexels (Pexels Team (2014)) to fine-tune our final video generation model.
Dataset Splits	No	The paper mentions several large datasets used for training (e.g., '16M image-text pairs sourced from Data Comp (Gadre et al. (2024)), COYO (Byeon et al. (2022)), Unsplash (Unsplash Team (2020)), and Journey DB (Sun et al. (2024a))', 'approximately 600M image-text pairs from LAION (Schuhmann et al. (2022))', '19M video-text pairs on a subset (Lin et al. (2024)) of Panda-70M (Chen et al. (2024b))'). However, it does not provide specific training/validation/test splits for these datasets for its own model training and evaluation, instead referring to external benchmarks for evaluation of generated samples.
Hardware Specification	Yes	Training details. NOVA is trained with sixteen A100 (40G) nodes.
Software Dependencies	No	The paper mentions using Open CV (cv2) for optical flow and a pre-trained language model, but does not provide specific version numbers for these or other key software libraries like PyTorch, TensorFlow, or Python, which are essential for reproducibility.
Experiment Setup	Yes	Training details. NOVA is trained with sixteen A100 (40G) nodes. We utilize the Adam W optimizer (Loshchilov et al. (2017)) (β1 = 0.9, β2 = 0.95) with a weight decay of 0.02 and a base learning rate of 1e-4 in all experiments. The peak learning rate is adjusted for different batch sizes during training using the scaling rule (Goyal (2017)) : lr = base lr batchsize/256. We train text-to-image models from scratch and then load these weights to train text-to-video models.