CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Authors: Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Xu Bin, Xiaotao Gu, Yuxiao Dong, Jie Tang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Results show that Cog Video X achieves state-of-the-art performance in both automated benchmarks and human evaluation. We evaluate Cog Video X through automated metric evaluation and human assessment, compared with openly-accessible top-performing text-to-video models. Cog Video X achieves state-of-the-art performance. |
| Researcher Affiliation | Collaboration | Zhipu AI Tsinghua University and emails like EMAIL, corresponding author: EMAIL indicate affiliations with both an industry entity (Zhipu AI) and an academic institution (Tsinghua University). |
| Pseudocode | No | The paper describes the methodology and architecture using textual descriptions and diagrams (e.g., Figure 3 illustrates the overall architecture, Figure 4 shows the structure of the 3D VAE), but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We publish the code and model checkpoints of Cog Video X along with our VAE model and video captioning model at https://github.com/THUDM/Cog Video. |
| Open Datasets | Yes | We additionally use 2B images filtered with aesthetics score from LAION-5B (Schuhmann et al., 2022) and COYO-700M (Byeon et al., 2022) datasets to assist training. There are some video caption datasets available now, such as Panda70M (Chen et al., 2024b), COCO Caption (Lin et al., 2014), and Web Vid Bain et al. (2021b). |
| Dataset Splits | Yes | Ablation studies on Web Vid test dataset with 500 videos. We compared our 3DVAE with other open-source 3DVAE on 256 256 resolution 17-frame videos, using the validation set of the Web Vid (Bain et al., 2021a). In table 14, we present the accuracy and recall of our classifier, trained based on video-llama, on the test set (10% randomly labeled data). |
| Hardware Specification | Yes | We evaluate the model on bf, H800 with 50 inference steps. We evaluate the model on bf, H800 with one dit forward step. |
| Software Dependencies | No | The paper mentions various models and frameworks used (e.g., T5, Llama 2, GPT-4, Cog VLM), but does not provide specific version numbers for software dependencies like programming languages or libraries needed to reproduce the experiments. |
| Experiment Setup | Yes | Table 5: Hyperparameters of Cogvideo X-2b and Cog Video-5b. and Table 6: Hyperparameters of Cogvideo X-2b and Cog Video-5b. provide detailed training stages with max resolutions, durations, batch sizes, sequence lengths, and training steps, as well as specific model hyperparameters like number of layers, attention heads, hidden size, learning rate decay, and training precision. |