Beware of Calibration Data for Pruning Large Language Models

Authors: Yixin Ji, Yang Xiang, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on recent strong open-source LLMs (e.g., DCLM, and LLa MA-3) show that the proposed strategy can enhance the performance of strong pruning methods (e.g., Wanda, DSno T, OWL) by a large margin (up to 2.68%)1.
Researcher Affiliation Collaboration Yixin Ji1,2, Yang Xiang1,2, Juntao Li1,2 , Qingrong Xia3, Ping Li3, Xinyu Duan3, Zhefeng Wang3, Min Zhang1,2 1School of Computer Science and Technology, Soochow University 2Key Laboratory of Data Intelligence and Advanced Computing, Soochow University 3Huawei Cloud, China
Pseudocode No The paper describes methods and processes like the self-generation strategy for sampling calibration data using a formula for generating tokens, but it does not present a structured pseudocode or algorithm block.
Open Source Code Yes 1Code is available at https://github.com/Dereck0602/calibration_data
Open Datasets Yes C4 (Raffel et al., 2020)3 is a widely used calibration data source... Wikipedia4 is a source of high-quality encyclopedic text... Slimpajama5 is a cleaned and deduplicated version of Red Pajama... DCLM (Li et al., 2024) is the pre-training data of DCLM-7B model... Instead, we choose the Alpaca (Taori et al., 2023) dataset...
Dataset Splits Yes Aside from the experiments in Section 3.3, we follow prior works and randomly sample 128 sequences with 2048 tokens as calibration data. To mitigate the impact of sampling randomness, all our experiments repeat the calibration data sampling 20 times with different random seeds and report the average performance. For MMLU, we use a 5-shot setting, while all other tasks are evaluated in a zero-shot setting.
Hardware Specification No The paper mentions LLMs like DCLM-7B and LLaMA-3, and discusses 'advanced GPUs already support 2:4 sparse tensor cores', but it does not specify any particular hardware (GPU model, CPU, memory, etc.) used to perform the experiments described within the paper.
Software Dependencies No The paper mentions that the evaluation code is based on the 'lm-evaluation-harness repository' but does not specify its version or any other software dependencies with version numbers.
Experiment Setup Yes During the self-generation process, we use Top-k and Top-p sampling to improve the diversity of the generated data. Specifically, we set the p-value to 0.95, the k-value to 50, and the temperature to 0.6. We apply the repetition penalty of 1.2 to avoid repeatedly generating low-quality fragments. We randomly sample 5,000 examples from C4, Slimpajama, Wikipedia, and DCLM respectively for generation. In the filtering phase, we eliminate the top 20% of samples based on their perplexity.