reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models

Authors: Pengxiang Zhao, Xiaoming Yuan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate GANQ s ability to reduce the perplexity gap from the FP16 baseline compared to state-ofthe-art methods for both 3-bit and 4-bit quantization. Furthermore, when deployed on a single NVIDIA RTX 4090 GPU, GANQ s quantized models achieve up to 2.57 speedup over the baseline, advancing memory and inference efficiency in LLM deployment.
Researcher Affiliation	Academia	1Department of Mathematics, The University of Hong Kong, Hong Kong SAR, China. Correspondence to: Xiaoming Yuan <EMAIL>.
Pseudocode	Yes	Finally, the full pseudocode of GANQ for layer-wise LUTbased non-uniform quantization is presented in Algorithm 1. Algorithm 1 GANQ: GPU-Adaptive Layer-Wise LUTBased Non-Uniform Quantization
Open Source Code	Yes	Our implementation is publicly available1. 1The code is available at https://github.com/Evans -Z/GANQ
Open Datasets	Yes	Following prior work (Frantar et al., 2022; Shao et al., 2024; Ma et al., 2024; Kim et al., 2024), we evaluate the quantized models by reporting perplexity on language datasets, specifically using the Wiki Text-2 (Merity et al., 2017), C4 (Raffel et al., 2020), and PTB (Marcus et al., 1994) datasets. Additionally, we assess accuracy on zero-shot tasks, including ARC Easy, ARC Challenge (Clark et al., 2018), Wino Grande (Sakaguchi et al., 2021), Bool Q (Clark et al., 2019), RTE (Wang et al., 2018), Hella Swag (Zellers et al., 2019), and GSM8K (Cobbe et al., 2021), facilitated by the LM Harness library (Gao et al., 2021).
Dataset Splits	Yes	Following prior work (Frantar et al., 2022; Shao et al., 2024; Ma et al., 2024; Kim et al., 2024), we evaluate the quantized models by reporting perplexity on language datasets, specifically using the Wiki Text-2 (Merity et al., 2017), C4 (Raffel et al., 2020), and PTB (Marcus et al., 1994) datasets. Consistent with established practice, we use a sequence length of 2,048 across all models.
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA RTX 4090 GPU.
Software Dependencies	No	We implement GANQ using the Py Torch (Paszke et al., 2019) and utilize the Hugging Face Transformers library (Wolf, 2019) for model and dataset management.
Experiment Setup	Yes	The default configuration employs INT4/3 per-channel weight quantization. ... For calibration data, we follow the methodology outlined in previous works (Frantar et al., 2022; Shao et al., 2024; Kim et al., 2024). Specifically, we use 32 sequences for OPT models and 128 sequences for LLa MA models. Each sequence consists of 2,048 tokens, sampled from the first shard of the C4 dataset. ... Using Torch CUDA profiler, we measure single-sequence (batch size 1) generation of 1024 tokens on a single NVIDIA RTX 4090 GPU, reporting CUDA time and peak memory usage with LUT-based inference kernels in (Kim et al., 2024).