reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Authors: Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Xu Bin, Xiaotao Gu, Yuxiao Dong, Jie Tang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Results show that Cog Video X achieves state-of-the-art performance in both automated benchmarks and human evaluation. We evaluate Cog Video X through automated metric evaluation and human assessment, compared with openly-accessible top-performing text-to-video models. Cog Video X achieves state-of-the-art performance.
Researcher Affiliation	Collaboration	Zhipu AI Tsinghua University and emails like EMAIL, corresponding author: EMAIL indicate affiliations with both an industry entity (Zhipu AI) and an academic institution (Tsinghua University).
Pseudocode	No	The paper describes the methodology and architecture using textual descriptions and diagrams (e.g., Figure 3 illustrates the overall architecture, Figure 4 shows the structure of the 3D VAE), but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We publish the code and model checkpoints of Cog Video X along with our VAE model and video captioning model at https://github.com/THUDM/Cog Video.
Open Datasets	Yes	We additionally use 2B images filtered with aesthetics score from LAION-5B (Schuhmann et al., 2022) and COYO-700M (Byeon et al., 2022) datasets to assist training. There are some video caption datasets available now, such as Panda70M (Chen et al., 2024b), COCO Caption (Lin et al., 2014), and Web Vid Bain et al. (2021b).
Dataset Splits	Yes	Ablation studies on Web Vid test dataset with 500 videos. We compared our 3DVAE with other open-source 3DVAE on 256 256 resolution 17-frame videos, using the validation set of the Web Vid (Bain et al., 2021a). In table 14, we present the accuracy and recall of our classifier, trained based on video-llama, on the test set (10% randomly labeled data).
Hardware Specification	Yes	We evaluate the model on bf, H800 with 50 inference steps. We evaluate the model on bf, H800 with one dit forward step.
Software Dependencies	No	The paper mentions various models and frameworks used (e.g., T5, Llama 2, GPT-4, Cog VLM), but does not provide specific version numbers for software dependencies like programming languages or libraries needed to reproduce the experiments.
Experiment Setup	Yes	Table 5: Hyperparameters of Cogvideo X-2b and Cog Video-5b. and Table 6: Hyperparameters of Cogvideo X-2b and Cog Video-5b. provide detailed training stages with max resolutions, durations, batch sizes, sequence lengths, and training steps, as well as specific model hyperparameters like number of layers, attention heads, hidden size, learning rate decay, and training precision.