CBQ: Cross-Block Quantization for Large Language Models
Authors: Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, Yunhe Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that CBQ achieves superior low-bit quantization (W4A4, W4A8, W2A16) and outperforms existing state-of-the-art methods across various LLMs and datasets. Notably, CBQ only takes 4.3 hours to quantize a weightonly quantization of a 4-bit LLAMA1-65B model, achieving a commendable trade off between performance and efficiency. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China 2 Huawei Noah s Ark Lab 3 Hong Kong University of Science and Technology (GZ) |
| Pseudocode | Yes | Algorithm 1: Coarse-to-Fine Preprocessing Input: The input tensor X, The balancing coefficient λ1, λ2 Output: Outlier O |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We validate our quantization scheme on various datasets which are divided into two categories. One is reported by the perplexity metric of language generation experiments on C4 (Raffel et al. (2020)) and Wiki Text2 (Merity et al. (2016)). The other is reported by the accuracy metric of zero-shot language tasks (Gao et al. (2021)) on PIQA (Bisk et al. (2020a)), Hella Swag (Clark et al. (2018)), ARC (Clark et al. (2018)), Mutual (Cui et al. (2020)) and Ehics (Hendrycks et al. (2020a)). |
| Dataset Splits | Yes | Following the setting of previous work (Frantar et al. (2022b); Liu et al. (2023b); Yao et al. (2024); Yuan et al. (2023)), our calibration dataset comprises 128 randomly selected 2048-token segments from C4 to ensure standardized benchmarking. |
| Hardware Specification | No | We quantize all models using a mini-batch size of 1 on a single GPU. (This statement is too general, it does not specify the GPU model or any other specific hardware details.) |
| Software Dependencies | No | The paper acknowledges the use of Mind Spore, CANN (Compute Architecture for Neural Networks) and Ascend AI Processor, but does not provide specific version numbers for any software libraries or dependencies. |
| Experiment Setup | Yes | To balance quantization performance and training speed, we utilize sliding windows containing two blocks with 3 epochs per window. For the Lo RA-Rounding technique, we set the rank r to 5. The optimization process involves adjusting the learnable quantization step sizes (SX and SW ) and the weight-rounding matrix (δW ) with learning rates of 1e 4, 1e 3, and 1e 4, respectively. To manage the learning rate, we utilize the Cosine Annealing LR scheduler. We quantize all models using a mini-batch size of 1 on a single GPU. |