FBQuant: FeedBack Quantization for Large Language Models
Authors: Yijiang Liu, Hengyu Fang, Liulu He, Rongyu Zhang, Yichuan Bai, Yuan Du, Li Du
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments demonstrate the efficiency and effectiveness of FBQuant across various LLMs. Notably, for 3-bit Llama2-7B, FBQuant improves zero-shot accuracy by 1.2%. In this section, we present the experimental setup of models, baselines, datasets, metrics and implementation details in Sec. 5.1. Then, we demonstrate the perplexity and zero-shot accuracy of various quantization methos in Sec. 5.2, followed by the performance of instruction-tuned models and the wallclock latency on real devices. |
| Researcher Affiliation | Academia | 1School of Electronic Science and Engineering, Nanjing University 2Interdisciplinary Research Center for Future Intelligent Chips (Chip-X), Nanjing University, Suzhou EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Layer-wise Reconstruction by FBQuant |
| Open Source Code | No | The paper does not explicitly provide a link to the source code or state that the code for FBQuant is publicly released. |
| Open Datasets | Yes | Following previous works [Frantar et al., 2022; Lin et al., 2024b], we employ 128 samples with a sequence length of 2048 in the subset of Wiki Text2 [Merity et al., 2016] training data for calibration. The perplexity results are tested on the Wiki Text2 validation set. The zero-shot evaluation is conducted using the open-source toolkit, i.e., Language Model Evaluation Harness [Gao et al., 2024], which has been utilized by other baselines. The evaluation datasets include Arc-Challenge [Clark et al., 2018], Arc Easy [Clark et al., 2018], Hella Swag [Zellers et al., 2019], MMLU [Hendrycks et al., 2021], PIQA [Bisk et al., 2020], Wino Grande [Sakaguchi et al., 2019], and Bool Q [Wang et al., 2019]. |
| Dataset Splits | Yes | Following previous works [Frantar et al., 2022; Lin et al., 2024b], we employ 128 samples with a sequence length of 2048 in the subset of Wiki Text2 [Merity et al., 2016] training data for calibration. The perplexity results are tested on the Wiki Text2 validation set. |
| Hardware Specification | Yes | All experiments are conducted using A100 and RTX 3090 GPUs. Both the A100 and 3090 GPUs are utilized for optimizing the sub-branches, while only the 3090 GPU is used for testing latency, as it is commonly available for personal use. |
| Software Dependencies | No | The paper mentions Hugging Face and CUDA kernel but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | In the main results, we set the rank parameter to 128. The total number of optimization epochs is set to 20. A group size of 128 is used in all quantization methods. Sub-branches are integrated into all linear layers in LLMs, such as Query, Key, Value, and Out projections in Attention blocks, as well as Down, Gate, and Up projections in Feed-Forward Networks. |