VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers

Authors: Juncan Deng, Shuaiting Li, Zeyu Wang, Hong Gu, Kedong Xu, Kejie Huang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that VQ4Di T establishes a new state-of-the-art in model size and performance trade-offs, quantizing weights to 2-bit precision while retaining acceptable image generation quality. The paper includes sections like 'Experiments', 'Experimental Settings', 'Main Results', and 'Ablation Study' and presents performance metrics in tables.
Researcher Affiliation Collaboration Juncan Deng 1*, Shuaiting Li1*, Zeyu Wang 1, Hong Gu 2, Kedong Xu 2, Kejie Huang 1 1Zhejiang University 2vivo Mobile Communication Co., Ltd EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. The methodology is described in prose and visualized in a pipeline diagram (Figure 1).
Open Source Code No The paper does not provide an explicit statement about the availability of source code, nor does it include any links to a code repository.
Open Datasets Yes Training Di Ts typically relies on the Image Net dataset (Russakovsky et al. 2015). Our method achieves competitive evaluation results compared to full-precision models on the Image Net (Russakovsky et al. 2015) benchmark.
Dataset Splits No The paper states: 'We select the pre-trained Di T XL/2 model as the floating-point reference model, which has two versions for generating images with resolutions of 256 256 and 512 512, respectively.' and 'The validation setup is generally consistent with the settings used in the original Di T paper (Peebles and Xie 2023).' While it references the Image Net dataset and mentions sampling 10k images for evaluation (which are generated images, not dataset splits), it does not explicitly provide the training/test/validation splits for the Image Net dataset used to train or evaluate the Di T models within this paper. It mentions a 'zero-data and block-wise calibration method' for its own calibration step.
Hardware Specification Yes VQ4Di T quantizes a Di T XL/2 model on a single NVIDIA A100 GPU within 20 minutes to 5 hours depending on the different quantization settings. ... allowing the experiments to be conducted on a single NVIDIA A100 GPU within 20 minutes to 5 hours.
Software Dependencies No The paper mentions using 'RMSprop optimizer' but does not specify any software libraries or frameworks (e.g., PyTorch, TensorFlow) with their version numbers that would be necessary to replicate the experiment.
Experiment Setup Yes We calibrate all quantized models using RMSprop optimizer, with a constant learning rate of 5 10 2 for ratios of candidate assignments and 1 10 4 for other parameters. The batch size and iteration are set to 16 and 500 respectively. We employ a DDPM scheduler with sampling timesteps of 50, 100, and 250. The classifierfree guidance (CFG) is set to 1.5.