VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers
Authors: Juncan Deng, Shuaiting Li, Zeyu Wang, Hong Gu, Kedong Xu, Kejie Huang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that VQ4Di T establishes a new state-of-the-art in model size and performance trade-offs, quantizing weights to 2-bit precision while retaining acceptable image generation quality. The paper includes sections like 'Experiments', 'Experimental Settings', 'Main Results', and 'Ablation Study' and presents performance metrics in tables. |
| Researcher Affiliation | Collaboration | Juncan Deng 1*, Shuaiting Li1*, Zeyu Wang 1, Hong Gu 2, Kedong Xu 2, Kejie Huang 1 1Zhejiang University 2vivo Mobile Communication Co., Ltd EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. The methodology is described in prose and visualized in a pipeline diagram (Figure 1). |
| Open Source Code | No | The paper does not provide an explicit statement about the availability of source code, nor does it include any links to a code repository. |
| Open Datasets | Yes | Training Di Ts typically relies on the Image Net dataset (Russakovsky et al. 2015). Our method achieves competitive evaluation results compared to full-precision models on the Image Net (Russakovsky et al. 2015) benchmark. |
| Dataset Splits | No | The paper states: 'We select the pre-trained Di T XL/2 model as the floating-point reference model, which has two versions for generating images with resolutions of 256 256 and 512 512, respectively.' and 'The validation setup is generally consistent with the settings used in the original Di T paper (Peebles and Xie 2023).' While it references the Image Net dataset and mentions sampling 10k images for evaluation (which are generated images, not dataset splits), it does not explicitly provide the training/test/validation splits for the Image Net dataset used to train or evaluate the Di T models within this paper. It mentions a 'zero-data and block-wise calibration method' for its own calibration step. |
| Hardware Specification | Yes | VQ4Di T quantizes a Di T XL/2 model on a single NVIDIA A100 GPU within 20 minutes to 5 hours depending on the different quantization settings. ... allowing the experiments to be conducted on a single NVIDIA A100 GPU within 20 minutes to 5 hours. |
| Software Dependencies | No | The paper mentions using 'RMSprop optimizer' but does not specify any software libraries or frameworks (e.g., PyTorch, TensorFlow) with their version numbers that would be necessary to replicate the experiment. |
| Experiment Setup | Yes | We calibrate all quantized models using RMSprop optimizer, with a constant learning rate of 5 10 2 for ratios of candidate assignments and 1 10 4 for other parameters. The batch size and iteration are set to 16 and 500 respectively. We employ a DDPM scheduler with sampling timesteps of 50, 100, and 250. The classifierfree guidance (CFG) is set to 1.5. |