GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration

Authors: Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, Priyadarshini Panda

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct our experiments on Dei T-S/B models (Touvron et al., 2021). We select 128 samples from Image Net training dataset as calibration data. The compared baselines are PTQ4Vi T (Liu et al., 2021), APQ-Vi T (Ding et al., 2022), PD-Quant (Liu et al., 2023b), Rep Q-Vi T (Li et al., 2023), and GPTQ (Frantar et al., 2022). Most of then are finetuneing-free approaches. On vision transformers, we use act order, an option in GPTQ that sorts the columns based on Hessian diagonal magnitude, which we found useful to improve the performance. The dampening ratio was set to 10% for improved generalization. We test with W2A4 and W4A4 quantization. We provide the results in Table 1 left part, from which we observe that GPTQ and our GPTAQ outperform the existing quantization regime due to explicit optimization of weights accounting for quantization error minimization.
Researcher Affiliation Academia 1Department of Electrical Engineering, Yale University. Correspondence to: Yuhang Li <EMAIL>.
Pseudocode Yes Algorithm 1 GPTAQ quantization for one layer Algorithm 2 GPTAQ quantization for entire transformer model
Open Source Code Yes Code is available at Github.
Open Datasets Yes We select 128 samples from Image Net training dataset as calibration data. We select 128 2048-token training sequences from the Wikitext2 training set as calibration dataset. We use 128 examples from the C4 datasets (Raffel et al., 2020) to calibrate the model.
Dataset Splits Yes We select 128 input samples as calibration dataset, see detailed source in each model type section. We select 128 2048-token training sequences from the Wikitext2 training set as calibration dataset. We use 128 examples from the C4 datasets (Raffel et al., 2020) to calibrate the model.
Hardware Specification Yes on a single GPU, we quantize a 405B language transformer as well as EVA-02 the rank first vision transformer that achieves 90% pretraining Imagenet accuracy. We additionally report the GPU Hours (on one A100) required to run the algorithm. We assume that XX and L are obtained previously, and test the latency on one A100 GPU with Py Torch 2.4.1-cu12.4.
Software Dependencies Yes We implement GPTAQ using Hugging Face (Wolf, 2019) on top of the Py Torch framework (Paszke et al., 2019). We assume that XX and L are obtained previously, and test the latency on one A100 GPU with Py Torch 2.4.1-cu12.4.
Experiment Setup Yes Unless specifically mentioned, we always use per-channel asymmetric quantization for weights and per-token asymmetric quantization for input activations. The input activation has a clipping ratio of 0.9 as suggested in Ashkboos et al. (2024) and the weight clipping range is searched by minimizing mean squared error (Frantar et al., 2022). We select 128 input samples as calibration dataset, see detailed source in each model type section. For GPTQ implementation, we first quantize weights and then quantize activation following prior work (Ashkboos et al., 2024; Liu et al., 2024), while our GPTAQ quantizes activations in the first place and minimizes layer output residual error in weight quantization2. On vision transformers, we use act order, an option in GPTQ that sorts the columns based on Hessian diagonal magnitude, which we found useful to improve the performance. The dampening ratio was set to 10% for improved generalization. We test with W2A4 and W4A4 quantization. We perform quantization in W4A4 and W2A4 scenarios as we did on the vision transformer. We use a symmetric format (no zero point), and the group size is set to 128.