reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OAC: Output-adaptive Calibration for Accurate Post-training Quantization

Authors: Ali Edalati, Alireza Ghaffari, Mahsa Ghazvini Nejad, Lu Hou, Boxing Chen, Masoud Asgharian, Vahid Partovi Nia

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate various language models with different sizes including OPT (Zhang et al. 2022), LLa Ma (Touvron et al. 2023a), and LLa Ma 2 (Touvron et al. 2023b) families. The calibration set comprises 128 sequences of 2048 tokens. To evaluate the performance of the quantized models on language modeling tasks, we report their perplexity on C4 (Raffel et al. 2020) and Wiki Text2 (Merity et al. 2017). Also, Language Model Evaluation Harness (LMEH) (Gao et al. 2023) is utilized for evaluating the reasoning abilities of the quantized models. We report the zero-shot accuracy on Wino Grande (Sakaguchi et al. 2021), Pi QA (Tata and Patel 2003), Hella Swag (Zellers et al. 2019), ARC-easy, and ARC-challenge (Clark et al. 2018) in addition to the fiveshot exact match on GSM8K (Cobbe et al. 2021) datasets. In our experimental results, we compare our method with the latest state-of-the-art PTQ methods...
Researcher Affiliation	Collaboration	Ali Edalati1, Alireza Ghaffari1,2, Mahsa Ghazvini Nejad1, Lu Hou1, Boxing Chen1, Masoud Asgharian2, Vahid Partovi Nia1 1Huawei Noah s Ark Lab 2Department of Mathematics and Statistics, Mc Gill University EMAIL
Pseudocode	Yes	Algorithm 1: OAC Pipeline
Open Source Code	No	The paper does not provide an explicit statement about releasing source code for the methodology, nor does it include a direct link to a code repository. It mentions referring to an appendix in a pre-print for details, but this is not a concrete code release.
Open Datasets	Yes	To evaluate the performance of the quantized models on language modeling tasks, we report their perplexity on C4 (Raffel et al. 2020) and Wiki Text2 (Merity et al. 2017). Also, Language Model Evaluation Harness (LMEH) (Gao et al. 2023) is utilized for evaluating the reasoning abilities of the quantized models. We report the zero-shot accuracy on Wino Grande (Sakaguchi et al. 2021), Pi QA (Tata and Patel 2003), Hella Swag (Zellers et al. 2019), ARC-easy, and ARC-challenge (Clark et al. 2018) in addition to the fiveshot exact match on GSM8K (Cobbe et al. 2021) datasets.
Dataset Splits	Yes	The calibration set comprises 128 sequences of 2048 tokens. To evaluate the performance of the quantized models on language modeling tasks, we report their perplexity on C4 (Raffel et al. 2020) and Wiki Text2 (Merity et al. 2017).
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It only mentions 'resource-limited machines' in a general context.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	The calibration set comprises 128 sequences of 2048 tokens. To develop a complete PTQ pipeline, we integrate a Hessian-based calibration technique with our proposed method. The OAC pipeline is described in Algorithm 1. ... Most of the Hessian-based calibration techniques can be employed in this phase. However, to apply OAC for accurate 2-bit PTQ of LLMs, the following steps from Sp QR (Dettmers et al. 2024) are integrated into our method. The salient weights are detected and isolated using equation (4)...