GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models
Authors: Pengxiang Zhao, Xiaoming Yuan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate GANQ s ability to reduce the perplexity gap from the FP16 baseline compared to state-ofthe-art methods for both 3-bit and 4-bit quantization. Furthermore, when deployed on a single NVIDIA RTX 4090 GPU, GANQ s quantized models achieve up to 2.57 speedup over the baseline, advancing memory and inference efficiency in LLM deployment. |
| Researcher Affiliation | Academia | 1Department of Mathematics, The University of Hong Kong, Hong Kong SAR, China. Correspondence to: Xiaoming Yuan <EMAIL>. |
| Pseudocode | Yes | Finally, the full pseudocode of GANQ for layer-wise LUTbased non-uniform quantization is presented in Algorithm 1. Algorithm 1 GANQ: GPU-Adaptive Layer-Wise LUTBased Non-Uniform Quantization |
| Open Source Code | Yes | Our implementation is publicly available1. 1The code is available at https://github.com/Evans -Z/GANQ |
| Open Datasets | Yes | Following prior work (Frantar et al., 2022; Shao et al., 2024; Ma et al., 2024; Kim et al., 2024), we evaluate the quantized models by reporting perplexity on language datasets, specifically using the Wiki Text-2 (Merity et al., 2017), C4 (Raffel et al., 2020), and PTB (Marcus et al., 1994) datasets. Additionally, we assess accuracy on zero-shot tasks, including ARC Easy, ARC Challenge (Clark et al., 2018), Wino Grande (Sakaguchi et al., 2021), Bool Q (Clark et al., 2019), RTE (Wang et al., 2018), Hella Swag (Zellers et al., 2019), and GSM8K (Cobbe et al., 2021), facilitated by the LM Harness library (Gao et al., 2021). |
| Dataset Splits | Yes | Following prior work (Frantar et al., 2022; Shao et al., 2024; Ma et al., 2024; Kim et al., 2024), we evaluate the quantized models by reporting perplexity on language datasets, specifically using the Wiki Text-2 (Merity et al., 2017), C4 (Raffel et al., 2020), and PTB (Marcus et al., 1994) datasets. Consistent with established practice, we use a sequence length of 2,048 across all models. |
| Hardware Specification | Yes | All experiments are conducted on a single NVIDIA RTX 4090 GPU. |
| Software Dependencies | No | We implement GANQ using the Py Torch (Paszke et al., 2019) and utilize the Hugging Face Transformers library (Wolf, 2019) for model and dataset management. |
| Experiment Setup | Yes | The default configuration employs INT4/3 per-channel weight quantization. ... For calibration data, we follow the methodology outlined in previous works (Frantar et al., 2022; Shao et al., 2024; Kim et al., 2024). Specifically, we use 32 sequences for OPT models and 128 sequences for LLa MA models. Each sequence consists of 2,048 tokens, sampled from the first shard of the C4 dataset. ... Using Torch CUDA profiler, we measure single-sequence (batch size 1) generation of 1024 tokens on a single NVIDIA RTX 4090 GPU, reporting CUDA time and peak memory usage with LUT-based inference kernels in (Kim et al., 2024). |