SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Authors: Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, Xiaojuan Qi

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that Sli M-LLM achieves superior performance across various LLMs at low bit-widths. For example, a 2-bit quantized LLa MA-7B model reduces memory usage by nearly 6x compared to the floating-point baseline, decreases perplexity by 48% compared to state-of-the-art gradient-free PTQ methods, and maintains GPU inference speed.
Researcher Affiliation Academia 1The University of Hong Kong 2ETH Zürich 3Beihang University. Correspondence to: Haotong Qin, Shiming Zhang, Xiaojuan Qi <EMAIL, EMAIL, EMAIL>.
Pseudocode Yes Algorithm 1 Main Framework of Sli M-LLM. func Sli M-LLM(w, x F , β, λ, N) ... Algorithm 2 Detailed functions in Sli M-LLM. func SBA(w, x F , Hin, β, N)
Open Source Code Yes Our code is available at https://github.com/Aaronhuang-778/Sli M-LLM.
Open Datasets Yes Experiments are carried out on the Wiki Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020)datasets.
Dataset Splits Yes We randomly select 128 samples from Wiki Text2 (Merity et al., 2016) as calibration data, each with 2048 tokens.
Hardware Specification Yes the quantization is carried out on a single NVIDIA A800 GPU. For Sli M-LLM+, we employ the Adam W optimizer, following Omni Quant (Shao et al., 2023), which is also feasible on a single A800.
Software Dependencies No The paper mentions "open-source Auto GPTQ" for extending CUDA kernel, but does not specify a version for Auto GPTQ or CUDA, or any other software component.
Experiment Setup Yes Perchannel group quantization is utilized in our framework with groupsize = 128 in experiments. Since no backpropagation in Sli M-LLM, the quantization is carried out on a single NVIDIA A800 GPU. For Sli M-LLM+, we employ the Adam W optimizer, following Omni Quant (Shao et al., 2023)... We randomly select 128 samples from Wiki Text2 (Merity et al., 2016) as calibration data, each with 2048 tokens... We empirically set λ at 0.1 and n at 50 to achieve a balance between efficiency and accuracy.