Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers

Authors: Weilun Feng, Chuanguang Yang, Haotong Qin, Xiangqi Li, Yu Wang, Zhulin An, Libo Huang, Boyu Diao, Zixiang Zhao, Yongjun Xu, Michele Magno

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our W3A6 Q-VDi T achieves a scene consistency of 23.40, setting a new benchmark and outperforming current state-of-the-art quantization methods by 1.9 . Code will be available at https://github.com/cantbebetter2/Q-VDi T. ... Extensive experiments on generative benchmarks show that Q-VDi T significantly outperforms current SOTA post-training quantization methods.
Researcher Affiliation Academia 1 Institute of Computing Technology, Chinese Academy of Sciences 2 University of Chinese Academy of Sciences 3 ETH Zurich.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No Code will be available at https://github.com/cantbebetter2/Q-VDi T.
Open Datasets Yes Following previous work Vi Di TQ (Zhao et al., 2024), we apply our Q-VDi T to Open SORA (HPC-AI, 2024) and Latte (Ma et al., 2024) for video generation task. ... We first evaluate the quantized model on VBench (Huang et al., 2024b) ... For Latte, we adopt the class-conditioned Latte model trained on UCF-101 and use the 20-step DDIM solver with CFG scale of 7.0. More details can be found in Appendix Sec. D. ... We employ one randomly selected video per label from the UCF-101 dataset (101 videos in total) (Soomro, 2012) as the reference ground-truth videos for FVD evaluation.
Dataset Splits No The paper mentions using 10 prompts from Open SORA and 101 prompts from UCF-101 for evaluation and calibration, and specific prompt sets for VBench evaluation (93 prompts, 72 prompts, 86 prompts), but does not provide details on training/test/validation dataset splits for the underlying models used in the experiments.
Hardware Specification No The paper mentions 'GPU memory' and 'GPU Time' in Table 5, but does not provide specific details on the GPU or CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers for libraries or frameworks used in the experiments.
Experiment Setup Yes We mainly focus on harder settings of W4A6 (4-bit weight quantization and 6-bit activation quantization), W3A8, and W3A6. ... For post-training quantization, we calibrate 5k iters for 6-8 bit, 10k iters for 4-bit, and 15k iters for 3-bit. For calibration parameters, we use a batch size of 4, learning rate of 1e-6 for weight quantization parameters, and 1e-5 for TQE parameters. ... For the Open-Sora (HPC-AI, 2024) model, we use 100-step DDIM with CFG scale of 4.0. For Latte, we adopt the class-conditioned Latte model trained on UCF-101 and use the 20-step DDIM solver with CFG scale of 7.0.