Treasures in Discarded Weights for LLM Quantization
Authors: Hao Yu, Yang Zhou, Bohua Chen, Zelan Yang, Shen Li, Yong Li, Jianxin Wu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the effectiveness of our approach on widely used benchmark datasets for LLMs. A large number of experiments have verified the effectiveness of our method. In three typical LLM families, our framework can be combined with various PTQ and QAT algorithms and improve the accuracies of low-bit quantization models, which demonstrates DWR s broad applicability and effectiveness in different scenarios. |
| Researcher Affiliation | Collaboration | 1National Key Laboratory for Novel Software Technology, Nanjing University 2School of Artificial Intelligence, Nanjing University 3Alibaba Cloud Computing |
| Pseudocode | Yes | Algorithm 1: The DWR Framework Input: The original large language model M and its low-bit PTQ model Mq with scales and zero points. The calibration dataset C. Output: The quantization model after compensation. 1: Calculate Mq s perplexity po on C. 2: for each layer in Mq do 3: Set pn (the current best perplexity) as p0, and kn (the current best k value) as zero. 4: Calculate discarded weights D for each FC in this transformer layer. 5: Pre-design a set of search space dimensions. 6: for dimension k in the search space do 7: Uniformly update all FC layers in the model layer by Equation 4, or 5, or 10 using this k value. 8: Calculate Mq s perplexity p with the updated FC layers in the calibration dataset C. 9: if p < pn then 10: Update pn with p and kn with k. 11: end if 12: end for 13: if pn < po then 14: Based on dimension kn and Equation 4, or 5, or 10, update Mq s layer. 15: Set po as pn. 16: end if 17: end for |
| Open Source Code | No | The paper does not provide an explicit statement about the release of source code or a link to a code repository. |
| Open Datasets | Yes | We evaluate the perplexity on the Wiki Text2 (Stephen et al. 2017) and C4 (Raffel et al. 2020) datasets. We further assess the zero-shot common sense question answering (QA) ability on tasks covering SIQA (Sap et al. 2019), Hella Swag (Zellers et al. 2019), PIQA (Bisk et al. 2020), Wino Grande (Sakaguchi et al. 2021), ARC (Clark et al. 2018), Bool Q (Clark et al. 2019), and Open Book QA (Mihaylov et al. 2018). We also evaluate both the zero-shot and five-shot performance of the LLMs on massively multitask language understanding (MMLU) benchmark (Hendrycks et al. 2021). [...] we randomly sample 23k data from Flanv2 (Longpre et al. 2023) dataset to fine-tune the quantization models. |
| Dataset Splits | Yes | GPTQ and Omni Quant take 128 samples from the C4 and Wiki Text2 datasets as calibration sets respectively, and each sample is 2048 tokens long. We use the same calibration set when performing DWR after these PTQ algorithms. [...] we randomly sample 23k data from Flanv2 (Longpre et al. 2023) dataset to fine-tune the quantization models. |
| Hardware Specification | Yes | We run LLa MA2-7B & 13B on 8 Tesla V100 GPUs and LLa MA2-70B on 4 80G A100 GPUs. |
| Software Dependencies | No | All experiments are conducted with Py Torch. |
| Experiment Setup | Yes | All training hyperparameters are the same as the original QA-Lo RA paper, and we randomly sample 23k data from Flanv2 (Longpre et al. 2023) dataset to fine-tune the quantization models. For the selection range of dimension k, we set the search interval to 512 on the 7B models, 1024 on the 13B and 30B models, and 2048 on the 70B model. [...] In the original settings of BLOOM-7B1 with INT4 quantization, we calculate perplexity by 128 samples, skip the first 1/6 blocks, and set the searching interval to 512. |