reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance

Authors: Jinuk Kim, Marwa El Halabi, Wonpyo Park, Clemens Js Schaefer, Deokjae Lee, Yeonhong Park, Jae W. Lee, Hyun Oh Song

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we demonstrate the versatility and effectiveness of our method across various quantization schemes. We first explore different quantization scenarios and identify the formats best suited to each setting, ultimately focusing on three main approaches: weight-only scalar, weight-only vector, and weight-and-activation quantization. By integrating the Guided Quant objective into existing methods, our results consistently achieve state-of-the-art PTQ performance. Refer to Appendix D.2 for details on how we incorporate Guided Quant objective into existing methods. Additional experiments and details, including the overall cost of our method, the effect of the number of groups g, and the endto-end finetuning results, are provided in Appendix E.
Researcher Affiliation	Collaboration	1 Department of Computer Science and Engineering, Seoul National University 2 Neural Processing Research Center 3 Samsung AI Lab, Montreal 4 Google.
Pseudocode	Yes	Algorithm 1 Guided Quant input Layer-wise quantization algorithm Q, number of groups g, number of linear layers L Algorithm 2 LNQ input Hessian of the objective H Rdin din, input weight W Rdin dout, initial assignment P(j) Rdin m for each output channel j.
Open Source Code	Yes	We release the code at https://github. com/snu-mllab/Guided Quant.
Open Datasets	Yes	We use the Red Pajama dataset (Computer, 2023) for calibration, following prior work (Egiazarian et al., 2024; Tseng et al., 2024a;b), with 1024 sentences, each containing 4096 tokens. We report perplexity on the Wiki Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020) validation sets.
Dataset Splits	Yes	We use the Red Pajama dataset (Computer, 2023) for calibration, following prior work (Egiazarian et al., 2024; Tseng et al., 2024a;b), with 1024 sentences, each containing 4096 tokens. For weight-and-activation quantization methods... we use the Wiki Text2 dataset (Merity et al., 2016) for calibration, with 128 sentences, each containing 2048 tokens (Ashkboos et al., 2024; Liu et al., 2024). Our finetuning setup uses training data from Red Pajama dataset (Computer, 2023), with a context size of 4096 tokens, a batch size of 128 sentences, and finetuning for 128 steps in 2-bit quantization and 32 steps in 3-bit quantization.
Hardware Specification	Yes	Table 2. End-to-end inference throughput of Llama-2 models on RTX 4090 GPU. Table 8. Total GPU cost incurred during the quantization process for LNQ and QTIP, both with and without Guided Quant, across various group sizes g. We specify the number and type of GPU used in the parentheses. R6A denotes the RTX 6000 Ada GPU. Table 9. Total GPU cost and disk usage incurred during the gradient and Hessian caching processes... R6A and A100 denote the RTX 6000 Ada GPU and the A100 GPU, respectively. To demonstrate the speedup achieved by our optimization techniques for the CD algorithm, we report the quantization time for quantizing the Llama-2-7B model into 4-bit precision on a single RTX 6000 Ada GPU. quantizing Llama-2-70B using our LNQ algorithm takes less than three hours when using 8 RTX 6000 Ada GPUs. Throughput is measured on an RTX 3090 GPU as the average of 5 runs, with standard deviation in parentheses.
Software Dependencies	Yes	after integrating the kernels with into a PyTorch-based inference pipeline optimized with the torch.compile function (Ansel et al., 2024; Gray, 2019). We evaluate on these tasks using version 0.4.3 of the lm-evaluation-harness library (Gao et al., 2024).
Experiment Setup	Yes	For weight-only quantization experiments, we set g = 4 for Llama-2-7B and Llama-2-13B, and g = 2 for Llama-2-70B. For weight-and-activation quantization experiments, we set g = 1. In our implementation, we scale the gradients by a large constant (we used 103 in all experiments) while computing the averaged Hessians Hk to prevent underflow. For Llama-2-7B and Llama-2-13B, we use T = 2 and K = 4, and for Llama-2-70B, we use T = 1 and K = 4 in all the experiments. To address this, we add a small constant λ = 10-7 to the diagonal of the matrix, as commonly done in prior work (Frantar & Alistarh, 2022; Frantar et al., 2023; van Baalen et al., 2024). Our finetuning setup uses training data from Red Pajama dataset (Computer, 2023), with a context size of 4096 tokens, a batch size of 128 sentences, and finetuning for 128 steps in 2-bit quantization and 32 steps in 3-bit quantization.