LO-BCQ: Locally Optimal Block Clustered Quantization for 4-bit (W4A4) LLM Inference

Authors: Reena Elangovan, Charbel Sakr, Anand Raghunathan, Brucek Khailany

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform accuracy studies on GPT3 (Shoeybi et al., 2020) (1.3B, 8B and 22B), Llama2 (Touvron et al., 2023) (7B and 70B) and Nemotron4 (15B and 340B) (Parmar et al., 2024) models. We evaluate PTQ inference accuracy on several downstream tasks including Wikitext-103 (Merity et al., 2016), MMLU (Hendrycks et al., 2021) and Eleuther AI s LM evaluation harness (Gao et al., 2024). In this section, we present our accuracy studies on downstream tasks comparing LO-BCQ to various other block quantization proposals. Next, we present ablation studies on varying LO-BCQ configurations and our calibration methodology, namely universal vs local.
Researcher Affiliation Collaboration Reena Elangovan EMAIL NVIDIA Corporation Charbel Sakr EMAIL NVIDIA Corporation Anand Raghunathan EMAIL Department of ECE Purdue University Brucek Khailany EMAIL NVIDIA Corporation
Pseudocode Yes Figure 3 presents an algorithm called Locally Optimal BCQ (LO-BCQ) to achieve this goal. LO-BCQ consists of two main steps: (i) updating block clusters with fixed per-cluster codebooks, and (ii) updating per-cluster codebooks with fixed block clusters. This algorithm begins at iteration 0 (initial condition) with a set of Nc initial codebooks {C(0) 1 , . . . , C(0) Nc } and unquantized operand blocks as inputs.
Open Source Code No No explicit statement or link to open-source code for the methodology is provided in the paper.
Open Datasets Yes We evaluate PTQ inference accuracy on several downstream tasks including Wikitext-103 (Merity et al., 2016), MMLU (Hendrycks et al., 2021) and Eleuther AI s LM evaluation harness (Gao et al., 2024).
Dataset Splits No The paper mentions evaluating on "0-shot LM evaluation harness tasks" and "5-shot MMLU tasks," and using "calibration data" but does not explicitly provide specific percentages, absolute sample counts, or detailed methodology for dataset splits for reproduction.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or cloud computing specifications used for running its experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers used to replicate the experiment.
Experiment Setup Yes In our experiments, we perform this calibration on one batch of activations from the training data of the GPT3-126M model and the Wikitext-103 dataset. We freeze these optimal codebooks across operands and models during all of our accuracy evaluations. Further, we represent each entry of the codebooks as a 6-bit integer. That is, once decoded, the inner product computations with a block array during inference can be performed at 6-bit precision. In this paper, we assume Bs = 8 and the data format F is floating point E4M3. Further, each codebook entry is a 6-bit integer (i.e, Bc = 6) and we vary Nc between 2 and 16, Lb between 2 and 8, and LA between 16 and 128 to obtain various LO-BCQ configurations.