GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance

Authors: Jinuk Kim, Marwa El Halabi, Wonpyo Park, Clemens Js Schaefer, Deokjae Lee, Yeonhong Park, Jae W. Lee, Hyun Oh Song

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we demonstrate the versatility and effectiveness of our method across various quantization schemes. We first explore different quantization scenarios and identify the formats best suited to each setting, ultimately focusing on three main approaches: weight-only scalar, weight-only vector, and weight-and-activation quantization. By integrating the Guided Quant objective into existing methods, our results consistently achieve state-of-the-art PTQ performance. Refer to Appendix D.2 for details on how we incorporate Guided Quant objective into existing methods. Additional experiments and details, including the overall cost of our method, the effect of the number of groups g, and the endto-end finetuning results, are provided in Appendix E.
Researcher Affiliation Collaboration 1 Department of Computer Science and Engineering, Seoul National University 2 Neural Processing Research Center 3 Samsung AI Lab, Montreal 4 Google.
Pseudocode Yes Algorithm 1 Guided Quant input Layer-wise quantization algorithm Q, number of groups g, number of linear layers L Algorithm 2 LNQ input Hessian of the objective H Rdin din, input weight W Rdin dout, initial assignment P(j) Rdin m for each output channel j.
Open Source Code Yes We release the code at https://github. com/snu-mllab/Guided Quant.
Open Datasets Yes We use the Red Pajama dataset (Computer, 2023) for calibration, following prior work (Egiazarian et al., 2024; Tseng et al., 2024a;b), with 1024 sentences, each containing 4096 tokens. We report perplexity on the Wiki Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020) validation sets.
Dataset Splits Yes We use the Red Pajama dataset (Computer, 2023) for calibration, following prior work (Egiazarian et al., 2024; Tseng et al., 2024a;b), with 1024 sentences, each containing 4096 tokens. For weight-and-activation quantization methods... we use the Wiki Text2 dataset (Merity et al., 2016) for calibration, with 128 sentences, each containing 2048 tokens (Ashkboos et al., 2024; Liu et al., 2024). Our finetuning setup uses training data from Red Pajama dataset (Computer, 2023), with a context size of 4096 tokens, a batch size of 128 sentences, and finetuning for 128 steps in 2-bit quantization and 32 steps in 3-bit quantization.
Hardware Specification Yes Table 2. End-to-end inference throughput of Llama-2 models on RTX 4090 GPU. Table 8. Total GPU cost incurred during the quantization process for LNQ and QTIP, both with and without Guided Quant, across various group sizes g. We specify the number and type of GPU used in the parentheses. R6A denotes the RTX 6000 Ada GPU. Table 9. Total GPU cost and disk usage incurred during the gradient and Hessian caching processes... R6A and A100 denote the RTX 6000 Ada GPU and the A100 GPU, respectively. To demonstrate the speedup achieved by our optimization techniques for the CD algorithm, we report the quantization time for quantizing the Llama-2-7B model into 4-bit precision on a single RTX 6000 Ada GPU. quantizing Llama-2-70B using our LNQ algorithm takes less than three hours when using 8 RTX 6000 Ada GPUs. Throughput is measured on an RTX 3090 GPU as the average of 5 runs, with standard deviation in parentheses.
Software Dependencies Yes after integrating the kernels with into a PyTorch-based inference pipeline optimized with the torch.compile function (Ansel et al., 2024; Gray, 2019). We evaluate on these tasks using version 0.4.3 of the lm-evaluation-harness library (Gao et al., 2024).
Experiment Setup Yes For weight-only quantization experiments, we set g = 4 for Llama-2-7B and Llama-2-13B, and g = 2 for Llama-2-70B. For weight-and-activation quantization experiments, we set g = 1. In our implementation, we scale the gradients by a large constant (we used 103 in all experiments) while computing the averaged Hessians Hk to prevent underflow. For Llama-2-7B and Llama-2-13B, we use T = 2 and K = 4, and for Llama-2-70B, we use T = 1 and K = 4 in all the experiments. To address this, we add a small constant λ = 10-7 to the diagonal of the matrix, as commonly done in prior work (Frantar & Alistarh, 2022; Frantar et al., 2023; van Baalen et al., 2024). Our finetuning setup uses training data from Red Pajama dataset (Computer, 2023), with a context size of 4096 tokens, a batch size of 128 sentences, and finetuning for 128 steps in 2-bit quantization and 32 steps in 3-bit quantization.