reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Authors: Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, Ying Tai

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments and ablation studies verify the superiority of Open Vid-1M over previous datasets and the effectiveness of our MVDi T. Project webpage is available at https: //nju-pcalab.github.io/projects/openvid. 6 EXPERIMENTS
Researcher Affiliation	Collaboration	Kepan Nan1 Rui Xie1 Penghao Zhou2 Tiehan Fan1 Zhenheng Yang2 Zhijie Chen2 Xiang Li3 Jian Yang1 Ying Tai1 1 State Key Laboratory for Novel Software Technology, Nanjing University 2 Byte Dance 3 Nankai University
Pseudocode	No	The paper describes the architecture of MVDi T and its modules (Multi-Modal Self-Attention, Multi-Modal Temporal-Attention, Multi-Head Cross-Attention) in detail, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper states: "Project webpage is available at https: //nju-pcalab.github.io/projects/openvid." This is a general project webpage and does not explicitly state that source code for the described methodology is available there, nor does it provide a direct link to a code repository.
Open Datasets	Yes	The paper explicitly introduces "Open Vid-1M" and "Open Vid HD-0.4M" datasets and states, "Open Vid-1M will be made publicly available". Additionally, it refers to and cites established public datasets used for comparison: "Web Vid-10M (Bain et al., 2021)" and "Panda-70M (Chen et al., 2024b)".
Dataset Splits	No	The paper mentions using 1,117 validation samples for human preference evaluation and sampling video clips for training, but it does not specify explicit training/test/validation dataset splits (e.g., percentages, counts, or predefined split references) for the main experiments. It states "We randomly sampled a subset from the collected raw data and processed it through our data processing pipeline." for evaluators, but this is not for the model training splits.
Hardware Specification	Yes	All experiments are conducted on NVIDIA A100 80G GPUs.
Software Dependencies	No	The paper mentions using Adam as an optimizer, LLaVA-v1.6-34b for captioning, CLIP, Uni Match, DOVER, Cascaded Cut Detector, LAION Aesthetics Predictor, Pix Art-α for weight initialization, and T5 model as the text encoder. However, it does not provide specific version numbers for any of these software components or libraries.
Experiment Setup	Yes	We use Adam (Kingma & Ba, 2014) as optimizer, and the learning rates is set to 2e 5. We sample video clips containing 16 frames at 3-frame intervals in each iteration. We adopt random horizontal flips and random crop to augment the clips during the training stage. All experiments are conducted on NVIDIA A100 80G GPUs. We adopt Pix Art-α (Chen et al., 2023b) for weight initialization and employ T5 model as the text encoder. The training process starts with 256 256 models, whose weights are then used to train 512 512 models, and these in turn serve as pretrained weights for 1024 1024 models.