OAC: Output-adaptive Calibration for Accurate Post-training Quantization

Authors: Ali Edalati, Alireza Ghaffari, Mahsa Ghazvini Nejad, Lu Hou, Boxing Chen, Masoud Asgharian, Vahid Partovi Nia

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate various language models with different sizes including OPT (Zhang et al. 2022), LLa Ma (Touvron et al. 2023a), and LLa Ma 2 (Touvron et al. 2023b) families. The calibration set comprises 128 sequences of 2048 tokens. To evaluate the performance of the quantized models on language modeling tasks, we report their perplexity on C4 (Raffel et al. 2020) and Wiki Text2 (Merity et al. 2017). Also, Language Model Evaluation Harness (LMEH) (Gao et al. 2023) is utilized for evaluating the reasoning abilities of the quantized models. We report the zero-shot accuracy on Wino Grande (Sakaguchi et al. 2021), Pi QA (Tata and Patel 2003), Hella Swag (Zellers et al. 2019), ARC-easy, and ARC-challenge (Clark et al. 2018) in addition to the fiveshot exact match on GSM8K (Cobbe et al. 2021) datasets. In our experimental results, we compare our method with the latest state-of-the-art PTQ methods...
Researcher Affiliation Collaboration Ali Edalati1, Alireza Ghaffari1,2, Mahsa Ghazvini Nejad1, Lu Hou1, Boxing Chen1, Masoud Asgharian2, Vahid Partovi Nia1 1Huawei Noah s Ark Lab 2Department of Mathematics and Statistics, Mc Gill University EMAIL
Pseudocode Yes Algorithm 1: OAC Pipeline
Open Source Code No The paper does not provide an explicit statement about releasing source code for the methodology, nor does it include a direct link to a code repository. It mentions referring to an appendix in a pre-print for details, but this is not a concrete code release.
Open Datasets Yes To evaluate the performance of the quantized models on language modeling tasks, we report their perplexity on C4 (Raffel et al. 2020) and Wiki Text2 (Merity et al. 2017). Also, Language Model Evaluation Harness (LMEH) (Gao et al. 2023) is utilized for evaluating the reasoning abilities of the quantized models. We report the zero-shot accuracy on Wino Grande (Sakaguchi et al. 2021), Pi QA (Tata and Patel 2003), Hella Swag (Zellers et al. 2019), ARC-easy, and ARC-challenge (Clark et al. 2018) in addition to the fiveshot exact match on GSM8K (Cobbe et al. 2021) datasets.
Dataset Splits Yes The calibration set comprises 128 sequences of 2048 tokens. To evaluate the performance of the quantized models on language modeling tasks, we report their perplexity on C4 (Raffel et al. 2020) and Wiki Text2 (Merity et al. 2017).
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It only mentions 'resource-limited machines' in a general context.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes The calibration set comprises 128 sequences of 2048 tokens. To develop a complete PTQ pipeline, we integrate a Hessian-based calibration technique with our proposed method. The OAC pipeline is described in Algorithm 1. ... Most of the Hessian-based calibration techniques can be employed in this phase. However, to apply OAC for accurate 2-bit PTQ of LLMs, the following steps from Sp QR (Dettmers et al. 2024) are integrated into our method. The salient weights are detected and isolated using equation (4)...