OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Authors: Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, Ying Tai

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments and ablation studies verify the superiority of Open Vid-1M over previous datasets and the effectiveness of our MVDi T. Project webpage is available at https: //nju-pcalab.github.io/projects/openvid. 6 EXPERIMENTS
Researcher Affiliation Collaboration Kepan Nan1 Rui Xie1 Penghao Zhou2 Tiehan Fan1 Zhenheng Yang2 Zhijie Chen2 Xiang Li3 Jian Yang1 Ying Tai1 1 State Key Laboratory for Novel Software Technology, Nanjing University 2 Byte Dance 3 Nankai University
Pseudocode No The paper describes the architecture of MVDi T and its modules (Multi-Modal Self-Attention, Multi-Modal Temporal-Attention, Multi-Head Cross-Attention) in detail, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code No The paper states: "Project webpage is available at https: //nju-pcalab.github.io/projects/openvid." This is a general project webpage and does not explicitly state that source code for the described methodology is available there, nor does it provide a direct link to a code repository.
Open Datasets Yes The paper explicitly introduces "Open Vid-1M" and "Open Vid HD-0.4M" datasets and states, "Open Vid-1M will be made publicly available". Additionally, it refers to and cites established public datasets used for comparison: "Web Vid-10M (Bain et al., 2021)" and "Panda-70M (Chen et al., 2024b)".
Dataset Splits No The paper mentions using 1,117 validation samples for human preference evaluation and sampling video clips for training, but it does not specify explicit training/test/validation dataset splits (e.g., percentages, counts, or predefined split references) for the main experiments. It states "We randomly sampled a subset from the collected raw data and processed it through our data processing pipeline." for evaluators, but this is not for the model training splits.
Hardware Specification Yes All experiments are conducted on NVIDIA A100 80G GPUs.
Software Dependencies No The paper mentions using Adam as an optimizer, LLaVA-v1.6-34b for captioning, CLIP, Uni Match, DOVER, Cascaded Cut Detector, LAION Aesthetics Predictor, Pix Art-α for weight initialization, and T5 model as the text encoder. However, it does not provide specific version numbers for any of these software components or libraries.
Experiment Setup Yes We use Adam (Kingma & Ba, 2014) as optimizer, and the learning rates is set to 2e 5. We sample video clips containing 16 frames at 3-frame intervals in each iteration. We adopt random horizontal flips and random crop to augment the clips during the training stage. All experiments are conducted on NVIDIA A100 80G GPUs. We adopt Pix Art-α (Chen et al., 2023b) for weight initialization and employ T5 model as the text encoder. The training process starts with 256 256 models, whose weights are then used to train 512 512 models, and these in turn serve as pretrained weights for 1024 1024 models.