FBQuant: FeedBack Quantization for Large Language Models

Authors: Yijiang Liu, Hengyu Fang, Liulu He, Rongyu Zhang, Yichuan Bai, Yuan Du, Li Du

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments demonstrate the efficiency and effectiveness of FBQuant across various LLMs. Notably, for 3-bit Llama2-7B, FBQuant improves zero-shot accuracy by 1.2%. In this section, we present the experimental setup of models, baselines, datasets, metrics and implementation details in Sec. 5.1. Then, we demonstrate the perplexity and zero-shot accuracy of various quantization methos in Sec. 5.2, followed by the performance of instruction-tuned models and the wallclock latency on real devices.
Researcher Affiliation Academia 1School of Electronic Science and Engineering, Nanjing University 2Interdisciplinary Research Center for Future Intelligent Chips (Chip-X), Nanjing University, Suzhou EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Layer-wise Reconstruction by FBQuant
Open Source Code No The paper does not explicitly provide a link to the source code or state that the code for FBQuant is publicly released.
Open Datasets Yes Following previous works [Frantar et al., 2022; Lin et al., 2024b], we employ 128 samples with a sequence length of 2048 in the subset of Wiki Text2 [Merity et al., 2016] training data for calibration. The perplexity results are tested on the Wiki Text2 validation set. The zero-shot evaluation is conducted using the open-source toolkit, i.e., Language Model Evaluation Harness [Gao et al., 2024], which has been utilized by other baselines. The evaluation datasets include Arc-Challenge [Clark et al., 2018], Arc Easy [Clark et al., 2018], Hella Swag [Zellers et al., 2019], MMLU [Hendrycks et al., 2021], PIQA [Bisk et al., 2020], Wino Grande [Sakaguchi et al., 2019], and Bool Q [Wang et al., 2019].
Dataset Splits Yes Following previous works [Frantar et al., 2022; Lin et al., 2024b], we employ 128 samples with a sequence length of 2048 in the subset of Wiki Text2 [Merity et al., 2016] training data for calibration. The perplexity results are tested on the Wiki Text2 validation set.
Hardware Specification Yes All experiments are conducted using A100 and RTX 3090 GPUs. Both the A100 and 3090 GPUs are utilized for optimizing the sub-branches, while only the 3090 GPU is used for testing latency, as it is commonly available for personal use.
Software Dependencies No The paper mentions Hugging Face and CUDA kernel but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes In the main results, we set the rank parameter to 128. The total number of optimization epochs is set to 20. A group size of 128 is used in all quantization methods. Sub-branches are integrated into all linear layers in LLMs, such as Query, Key, Value, and Out projections in Attention blocks, as well as Down, Gate, and Up projections in Feed-Forward Networks.