Beware of Calibration Data for Pruning Large Language Models
Authors: Yixin Ji, Yang Xiang, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on recent strong open-source LLMs (e.g., DCLM, and LLa MA-3) show that the proposed strategy can enhance the performance of strong pruning methods (e.g., Wanda, DSno T, OWL) by a large margin (up to 2.68%)1. |
| Researcher Affiliation | Collaboration | Yixin Ji1,2, Yang Xiang1,2, Juntao Li1,2 , Qingrong Xia3, Ping Li3, Xinyu Duan3, Zhefeng Wang3, Min Zhang1,2 1School of Computer Science and Technology, Soochow University 2Key Laboratory of Data Intelligence and Advanced Computing, Soochow University 3Huawei Cloud, China |
| Pseudocode | No | The paper describes methods and processes like the self-generation strategy for sampling calibration data using a formula for generating tokens, but it does not present a structured pseudocode or algorithm block. |
| Open Source Code | Yes | 1Code is available at https://github.com/Dereck0602/calibration_data |
| Open Datasets | Yes | C4 (Raffel et al., 2020)3 is a widely used calibration data source... Wikipedia4 is a source of high-quality encyclopedic text... Slimpajama5 is a cleaned and deduplicated version of Red Pajama... DCLM (Li et al., 2024) is the pre-training data of DCLM-7B model... Instead, we choose the Alpaca (Taori et al., 2023) dataset... |
| Dataset Splits | Yes | Aside from the experiments in Section 3.3, we follow prior works and randomly sample 128 sequences with 2048 tokens as calibration data. To mitigate the impact of sampling randomness, all our experiments repeat the calibration data sampling 20 times with different random seeds and report the average performance. For MMLU, we use a 5-shot setting, while all other tasks are evaluated in a zero-shot setting. |
| Hardware Specification | No | The paper mentions LLMs like DCLM-7B and LLaMA-3, and discusses 'advanced GPUs already support 2:4 sparse tensor cores', but it does not specify any particular hardware (GPU model, CPU, memory, etc.) used to perform the experiments described within the paper. |
| Software Dependencies | No | The paper mentions that the evaluation code is based on the 'lm-evaluation-harness repository' but does not specify its version or any other software dependencies with version numbers. |
| Experiment Setup | Yes | During the self-generation process, we use Top-k and Top-p sampling to improve the diversity of the generated data. Specifically, we set the p-value to 0.95, the k-value to 50, and the temperature to 0.6. We apply the repetition penalty of 1.2 to avoid repeatedly generating low-quality fragments. We randomly sample 5,000 examples from C4, Slimpajama, Wikipedia, and DCLM respectively for generation. In the filtering phase, we eliminate the top 20% of samples based on their perplexity. |