reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization

Authors: Runsheng Bai, Bo Liu, Qiang Liu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method across various models, including LLa MA (Touvron et al., 2023a), LLa MA2 (Touvron et al., 2023b), and OPT (Zhang et al., 2022), to assess its generalizability. Results for the LLa MA models are emphasized in the main text due to their widespread adoption, while comprehensive results for other models can be found in Appendix B.1. Regarding datasets, we primarily utilize Wiki Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020) for evaluation, along with 100/128 samples from the C4 dataset for calibration. We use the perplexity of language generation experiments as a primary metric, reporting results on the Wiki Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020) datasets. Additionally, we evaluate accuracy on all Pi QA(Bisk et al., 2020), ARCChallenge/Easy(Clark et al., 2018) and MMLU (Hendrycks et al., 2020) benchmarks under both INT3/INT4 settings to assess the problem-solving capability of our quantized model. 5.4. Ablation Study
Researcher Affiliation	Academia	1EECS Department, Massachusetts Institute of Technology 2Department of Computer Science, University of Texas at Austin. Correspondence to: Qiang Liu <EMAIL>, Runsheng Bai <EMAIL>.
Pseudocode	Yes	Algorithm 1 Algorithm for Bit Allocation Algorithm 2 Iterative Optimization Algorithm 3 Overall Algorithm for SKIM
Open Source Code	No	The paper does not provide an explicit statement about releasing its own source code, nor does it provide a direct link to a code repository. It mentions using 'official code' of Squeeze LLM but not for its own methodology.
Open Datasets	Yes	Regarding datasets, we primarily utilize Wiki Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020) for evaluation, along with 100/128 samples from the C4 dataset for calibration.
Dataset Splits	Yes	Regarding datasets, we primarily utilize Wiki Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020) for evaluation, along with 100/128 samples from the C4 dataset for calibration. We use the perplexity of language generation experiments as a primary metric, reporting results on the Wiki Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020) datasets. Since our calibration dataset is derived from C4, perplexity on Wiki Text2 represents a zero-shot scenario, while perplexity on C4 corresponds to a few-shot scenario.
Hardware Specification	Yes	Overall, the entire process for quantizing LLa MA-7B takes around one hour with dual AMD EPYC processors and an RTX 3090 GPU.
Software Dependencies	No	The paper mentions 'Adam optimizer' but does not specify version numbers for any software libraries or dependencies used (e.g., PyTorch, TensorFlow, CUDA versions).
Experiment Setup	Yes	The default setting includes INT4 and INT3, as well as INT3 and INT2 with extra memory usage. Note that we have set the maximum available bit to 4 in order to maintain high memory efficiency. Consequently mixed precision is disabled under the INT4 setting. And to optimize the scaling vector, we utilize the Adam (Kingma, 2014) optimizer with a learning rate of 0.01, a decrease rate of 0.5 every 40 steps and a maximum number of iterations of 120.