reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LO-BCQ: Locally Optimal Block Clustered Quantization for 4-bit (W4A4) LLM Inference

Authors: Reena Elangovan, Charbel Sakr, Anand Raghunathan, Brucek Khailany

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform accuracy studies on GPT3 (Shoeybi et al., 2020) (1.3B, 8B and 22B), Llama2 (Touvron et al., 2023) (7B and 70B) and Nemotron4 (15B and 340B) (Parmar et al., 2024) models. We evaluate PTQ inference accuracy on several downstream tasks including Wikitext-103 (Merity et al., 2016), MMLU (Hendrycks et al., 2021) and Eleuther AI s LM evaluation harness (Gao et al., 2024). In this section, we present our accuracy studies on downstream tasks comparing LO-BCQ to various other block quantization proposals. Next, we present ablation studies on varying LO-BCQ configurations and our calibration methodology, namely universal vs local.
Researcher Affiliation	Collaboration	Reena Elangovan EMAIL NVIDIA Corporation Charbel Sakr EMAIL NVIDIA Corporation Anand Raghunathan EMAIL Department of ECE Purdue University Brucek Khailany EMAIL NVIDIA Corporation
Pseudocode	Yes	Figure 3 presents an algorithm called Locally Optimal BCQ (LO-BCQ) to achieve this goal. LO-BCQ consists of two main steps: (i) updating block clusters with fixed per-cluster codebooks, and (ii) updating per-cluster codebooks with fixed block clusters. This algorithm begins at iteration 0 (initial condition) with a set of Nc initial codebooks {C(0) 1 , . . . , C(0) Nc } and unquantized operand blocks as inputs.
Open Source Code	No	No explicit statement or link to open-source code for the methodology is provided in the paper.
Open Datasets	Yes	We evaluate PTQ inference accuracy on several downstream tasks including Wikitext-103 (Merity et al., 2016), MMLU (Hendrycks et al., 2021) and Eleuther AI s LM evaluation harness (Gao et al., 2024).
Dataset Splits	No	The paper mentions evaluating on "0-shot LM evaluation harness tasks" and "5-shot MMLU tasks," and using "calibration data" but does not explicitly provide specific percentages, absolute sample counts, or detailed methodology for dataset splits for reproduction.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or cloud computing specifications used for running its experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers used to replicate the experiment.
Experiment Setup	Yes	In our experiments, we perform this calibration on one batch of activations from the training data of the GPT3-126M model and the Wikitext-103 dataset. We freeze these optimal codebooks across operands and models during all of our accuracy evaluations. Further, we represent each entry of the codebooks as a 6-bit integer. That is, once decoded, the inner product computations with a block array during inference can be performed at 6-bit precision. In this paper, we assume Bs = 8 and the data format F is floating point E4M3. Further, each codebook entry is a 6-bit integer (i.e, Bc = 6) and we vary Nc between 2 and 16, Lb between 2 and 8, and LA between 16 and 128 to obtain various LO-BCQ configurations.