SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization

Authors: Runsheng Bai, Bo Liu, Qiang Liu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method across various models, including LLa MA (Touvron et al., 2023a), LLa MA2 (Touvron et al., 2023b), and OPT (Zhang et al., 2022), to assess its generalizability. Results for the LLa MA models are emphasized in the main text due to their widespread adoption, while comprehensive results for other models can be found in Appendix B.1. Regarding datasets, we primarily utilize Wiki Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020) for evaluation, along with 100/128 samples from the C4 dataset for calibration. We use the perplexity of language generation experiments as a primary metric, reporting results on the Wiki Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020) datasets. Additionally, we evaluate accuracy on all Pi QA(Bisk et al., 2020), ARCChallenge/Easy(Clark et al., 2018) and MMLU (Hendrycks et al., 2020) benchmarks under both INT3/INT4 settings to assess the problem-solving capability of our quantized model. 5.4. Ablation Study
Researcher Affiliation Academia 1EECS Department, Massachusetts Institute of Technology 2Department of Computer Science, University of Texas at Austin. Correspondence to: Qiang Liu <EMAIL>, Runsheng Bai <EMAIL>.
Pseudocode Yes Algorithm 1 Algorithm for Bit Allocation Algorithm 2 Iterative Optimization Algorithm 3 Overall Algorithm for SKIM
Open Source Code No The paper does not provide an explicit statement about releasing its own source code, nor does it provide a direct link to a code repository. It mentions using 'official code' of Squeeze LLM but not for its own methodology.
Open Datasets Yes Regarding datasets, we primarily utilize Wiki Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020) for evaluation, along with 100/128 samples from the C4 dataset for calibration.
Dataset Splits Yes Regarding datasets, we primarily utilize Wiki Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020) for evaluation, along with 100/128 samples from the C4 dataset for calibration. We use the perplexity of language generation experiments as a primary metric, reporting results on the Wiki Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020) datasets. Since our calibration dataset is derived from C4, perplexity on Wiki Text2 represents a zero-shot scenario, while perplexity on C4 corresponds to a few-shot scenario.
Hardware Specification Yes Overall, the entire process for quantizing LLa MA-7B takes around one hour with dual AMD EPYC processors and an RTX 3090 GPU.
Software Dependencies No The paper mentions 'Adam optimizer' but does not specify version numbers for any software libraries or dependencies used (e.g., PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes The default setting includes INT4 and INT3, as well as INT3 and INT2 with extra memory usage. Note that we have set the maximum available bit to 4 in order to maintain high memory efficiency. Consequently mixed precision is disabled under the INT4 setting. And to optimize the scaling vector, we utilize the Adam (Kingma, 2014) optimizer with a learning rate of 0.01, a decrease rate of 0.5 every 40 steps and a maximum number of iterations of 120.