SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models
Authors: Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, Xiaojuan Qi
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that Sli M-LLM achieves superior performance across various LLMs at low bit-widths. For example, a 2-bit quantized LLa MA-7B model reduces memory usage by nearly 6x compared to the floating-point baseline, decreases perplexity by 48% compared to state-of-the-art gradient-free PTQ methods, and maintains GPU inference speed. |
| Researcher Affiliation | Academia | 1The University of Hong Kong 2ETH Zürich 3Beihang University. Correspondence to: Haotong Qin, Shiming Zhang, Xiaojuan Qi <EMAIL, EMAIL, EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Main Framework of Sli M-LLM. func Sli M-LLM(w, x F , β, λ, N) ... Algorithm 2 Detailed functions in Sli M-LLM. func SBA(w, x F , Hin, β, N) |
| Open Source Code | Yes | Our code is available at https://github.com/Aaronhuang-778/Sli M-LLM. |
| Open Datasets | Yes | Experiments are carried out on the Wiki Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020)datasets. |
| Dataset Splits | Yes | We randomly select 128 samples from Wiki Text2 (Merity et al., 2016) as calibration data, each with 2048 tokens. |
| Hardware Specification | Yes | the quantization is carried out on a single NVIDIA A800 GPU. For Sli M-LLM+, we employ the Adam W optimizer, following Omni Quant (Shao et al., 2023), which is also feasible on a single A800. |
| Software Dependencies | No | The paper mentions "open-source Auto GPTQ" for extending CUDA kernel, but does not specify a version for Auto GPTQ or CUDA, or any other software component. |
| Experiment Setup | Yes | Perchannel group quantization is utilized in our framework with groupsize = 128 in experiments. Since no backpropagation in Sli M-LLM, the quantization is carried out on a single NVIDIA A800 GPU. For Sli M-LLM+, we employ the Adam W optimizer, following Omni Quant (Shao et al., 2023)... We randomly select 128 samples from Wiki Text2 (Merity et al., 2016) as calibration data, each with 2048 tokens... We empirically set λ at 0.1 and n at 50 to achieve a balance between efficiency and accuracy. |