CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Authors: Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Xu Bin, Xiaotao Gu, Yuxiao Dong, Jie Tang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results show that Cog Video X achieves state-of-the-art performance in both automated benchmarks and human evaluation. We evaluate Cog Video X through automated metric evaluation and human assessment, compared with openly-accessible top-performing text-to-video models. Cog Video X achieves state-of-the-art performance.
Researcher Affiliation Collaboration Zhipu AI Tsinghua University and emails like EMAIL, corresponding author: EMAIL indicate affiliations with both an industry entity (Zhipu AI) and an academic institution (Tsinghua University).
Pseudocode No The paper describes the methodology and architecture using textual descriptions and diagrams (e.g., Figure 3 illustrates the overall architecture, Figure 4 shows the structure of the 3D VAE), but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes We publish the code and model checkpoints of Cog Video X along with our VAE model and video captioning model at https://github.com/THUDM/Cog Video.
Open Datasets Yes We additionally use 2B images filtered with aesthetics score from LAION-5B (Schuhmann et al., 2022) and COYO-700M (Byeon et al., 2022) datasets to assist training. There are some video caption datasets available now, such as Panda70M (Chen et al., 2024b), COCO Caption (Lin et al., 2014), and Web Vid Bain et al. (2021b).
Dataset Splits Yes Ablation studies on Web Vid test dataset with 500 videos. We compared our 3DVAE with other open-source 3DVAE on 256 256 resolution 17-frame videos, using the validation set of the Web Vid (Bain et al., 2021a). In table 14, we present the accuracy and recall of our classifier, trained based on video-llama, on the test set (10% randomly labeled data).
Hardware Specification Yes We evaluate the model on bf, H800 with 50 inference steps. We evaluate the model on bf, H800 with one dit forward step.
Software Dependencies No The paper mentions various models and frameworks used (e.g., T5, Llama 2, GPT-4, Cog VLM), but does not provide specific version numbers for software dependencies like programming languages or libraries needed to reproduce the experiments.
Experiment Setup Yes Table 5: Hyperparameters of Cogvideo X-2b and Cog Video-5b. and Table 6: Hyperparameters of Cogvideo X-2b and Cog Video-5b. provide detailed training stages with max resolutions, durations, batch sizes, sequence lengths, and training steps, as well as specific model hyperparameters like number of layers, attention heads, hidden size, learning rate decay, and training precision.