GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration
Authors: Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, Priyadarshini Panda
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct our experiments on Dei T-S/B models (Touvron et al., 2021). We select 128 samples from Image Net training dataset as calibration data. The compared baselines are PTQ4Vi T (Liu et al., 2021), APQ-Vi T (Ding et al., 2022), PD-Quant (Liu et al., 2023b), Rep Q-Vi T (Li et al., 2023), and GPTQ (Frantar et al., 2022). Most of then are finetuneing-free approaches. On vision transformers, we use act order, an option in GPTQ that sorts the columns based on Hessian diagonal magnitude, which we found useful to improve the performance. The dampening ratio was set to 10% for improved generalization. We test with W2A4 and W4A4 quantization. We provide the results in Table 1 left part, from which we observe that GPTQ and our GPTAQ outperform the existing quantization regime due to explicit optimization of weights accounting for quantization error minimization. |
| Researcher Affiliation | Academia | 1Department of Electrical Engineering, Yale University. Correspondence to: Yuhang Li <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 GPTAQ quantization for one layer Algorithm 2 GPTAQ quantization for entire transformer model |
| Open Source Code | Yes | Code is available at Github. |
| Open Datasets | Yes | We select 128 samples from Image Net training dataset as calibration data. We select 128 2048-token training sequences from the Wikitext2 training set as calibration dataset. We use 128 examples from the C4 datasets (Raffel et al., 2020) to calibrate the model. |
| Dataset Splits | Yes | We select 128 input samples as calibration dataset, see detailed source in each model type section. We select 128 2048-token training sequences from the Wikitext2 training set as calibration dataset. We use 128 examples from the C4 datasets (Raffel et al., 2020) to calibrate the model. |
| Hardware Specification | Yes | on a single GPU, we quantize a 405B language transformer as well as EVA-02 the rank first vision transformer that achieves 90% pretraining Imagenet accuracy. We additionally report the GPU Hours (on one A100) required to run the algorithm. We assume that XX and L are obtained previously, and test the latency on one A100 GPU with Py Torch 2.4.1-cu12.4. |
| Software Dependencies | Yes | We implement GPTAQ using Hugging Face (Wolf, 2019) on top of the Py Torch framework (Paszke et al., 2019). We assume that XX and L are obtained previously, and test the latency on one A100 GPU with Py Torch 2.4.1-cu12.4. |
| Experiment Setup | Yes | Unless specifically mentioned, we always use per-channel asymmetric quantization for weights and per-token asymmetric quantization for input activations. The input activation has a clipping ratio of 0.9 as suggested in Ashkboos et al. (2024) and the weight clipping range is searched by minimizing mean squared error (Frantar et al., 2022). We select 128 input samples as calibration dataset, see detailed source in each model type section. For GPTQ implementation, we first quantize weights and then quantize activation following prior work (Ashkboos et al., 2024; Liu et al., 2024), while our GPTAQ quantizes activations in the first place and minimizes layer output residual error in weight quantization2. On vision transformers, we use act order, an option in GPTQ that sorts the columns based on Hessian diagonal magnitude, which we found useful to improve the performance. The dampening ratio was set to 10% for improved generalization. We test with W2A4 and W4A4 quantization. We perform quantization in W4A4 and W2A4 scenarios as we did on the vision transformer. We use a symmetric format (no zero point), and the group size is set to 128. |