ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models
Authors: Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments Experimental Setup Baseline. For weight-only quantization, we compare our approach with GPTQ(Frantar et al. 2022), AWQ(Lin et al. 2024a), Omni Quant(Shao et al. 2023), and Affine Quant(Ma et al. 2024b). For weight-activation quantization, we benchmark our method against Smooth Quant(Xiao et al. 2023), Omni Quant(Shao et al. 2023), and I-LLM(Hu et al. 2024b). Models and Datasets. We primarily evaluate our method using LLa MA (7B-13B) (Touvron et al. 2023a) and LLa MA-2 (7B-13B) (Touvron et al. 2023b)in this paper. Following previous work(Shao et al. 2023; Ma et al. 2024b), we evaluate the quantized models by reporting the perplexity of language generation experiments on Wiki Text2(Merity et al. 2016) and C4(Raffel et al. 2020). |
| Researcher Affiliation | Industry | Chao Zeng * , Songwei Liu*, Yusheng Xie*, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei Byte Dance Inc, Shenzhen, China EMAIL, EMAIL |
| Pseudocode | No | The paper describes its methodology using mathematical equations and descriptive text, but it does not include any clearly labeled pseudocode blocks or algorithm figures. |
| Open Source Code | No | The paper does not explicitly state that source code is provided, nor does it include any links to code repositories or mention code in supplementary materials. |
| Open Datasets | Yes | Models and Datasets. We primarily evaluate our method using LLa MA (7B-13B) (Touvron et al. 2023a) and LLa MA-2 (7B-13B) (Touvron et al. 2023b)in this paper. Following previous work(Shao et al. 2023; Ma et al. 2024b), we evaluate the quantized models by reporting the perplexity of language generation experiments on Wiki Text2(Merity et al. 2016) and C4(Raffel et al. 2020). To assess performance on zero-shot tasks, we select several popular benchmarks including PIQA(Bisk et al. 2020), ARC(Clark et al. 2018), Bool Q(Clark et al. 2019), Hella Swag(Zellers et al. 2019), and Winogrande(Sakaguchi et al. 2021) using the lmevaluation-harness(Gao et al. 2021). |
| Dataset Splits | Yes | Calibration. ...Calibration data includes 128 randomly selected 2048-token segments from Wiki Text2. ... We primarily evaluate our method using LLa MA (7B-13B) (Touvron et al. 2023a) and LLa MA-2 (7B-13B) (Touvron et al. 2023b)in this paper. Following previous work(Shao et al. 2023; Ma et al. 2024b), we evaluate the quantized models by reporting the perplexity of language generation experiments on Wiki Text2(Merity et al. 2016) and C4(Raffel et al. 2020). To assess performance on zero-shot tasks, we select several popular benchmarks including PIQA(Bisk et al. 2020), ARC(Clark et al. 2018), Bool Q(Clark et al. 2019), Hella Swag(Zellers et al. 2019), and Winogrande(Sakaguchi et al. 2021) using the lmevaluation-harness(Gao et al. 2021). |
| Hardware Specification | Yes | The calibration process, conducted on an NVIDIA A800-40G GPU, utilized a batch size of 1 and spanned 20 epochs. ... Our experiments were conducted on two different GPUs: the RTX 4080 and the RTX 3070. |
| Software Dependencies | No | The paper mentions software components like "Adam W optimizer (Loshchilov and Hutter 2017)", "CUTLASS", and "cu BLAS" but does not provide specific version numbers for any of them or other software dependencies required to replicate the experiments. |
| Experiment Setup | Yes | Calibration. We initialize the balance vectors for weights and activations following (Xiao et al. 2023), with the learnable clipping parameter for weights set to 1. For distribution correction compensation vectors, we set a as an all-ones vector and b as an all-zeros vector, ensuring ab starts at 0. Using the Adam W optimizer (Loshchilov and Hutter 2017) with no weight decay, we set learning rates of 5e-3 for balance vectors and 1e-2 for the clipping parameter and vector compensation vector. Calibration data includes 128 randomly selected 2048-token segments from Wiki Text2. The calibration process, conducted on an NVIDIA A800-40G GPU, utilized a batch size of 1 and spanned 20 epochs. For activation and KV Cache we perform per-token quantization, and for weight we perform per-channel quantization. |