reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beware of Calibration Data for Pruning Large Language Models

Authors: Yixin Ji, Yang Xiang, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on recent strong open-source LLMs (e.g., DCLM, and LLa MA-3) show that the proposed strategy can enhance the performance of strong pruning methods (e.g., Wanda, DSno T, OWL) by a large margin (up to 2.68%)1.
Researcher Affiliation	Collaboration	Yixin Ji1,2, Yang Xiang1,2, Juntao Li1,2 , Qingrong Xia3, Ping Li3, Xinyu Duan3, Zhefeng Wang3, Min Zhang1,2 1School of Computer Science and Technology, Soochow University 2Key Laboratory of Data Intelligence and Advanced Computing, Soochow University 3Huawei Cloud, China
Pseudocode	No	The paper describes methods and processes like the self-generation strategy for sampling calibration data using a formula for generating tokens, but it does not present a structured pseudocode or algorithm block.
Open Source Code	Yes	1Code is available at https://github.com/Dereck0602/calibration_data
Open Datasets	Yes	C4 (Raffel et al., 2020)3 is a widely used calibration data source... Wikipedia4 is a source of high-quality encyclopedic text... Slimpajama5 is a cleaned and deduplicated version of Red Pajama... DCLM (Li et al., 2024) is the pre-training data of DCLM-7B model... Instead, we choose the Alpaca (Taori et al., 2023) dataset...
Dataset Splits	Yes	Aside from the experiments in Section 3.3, we follow prior works and randomly sample 128 sequences with 2048 tokens as calibration data. To mitigate the impact of sampling randomness, all our experiments repeat the calibration data sampling 20 times with different random seeds and report the average performance. For MMLU, we use a 5-shot setting, while all other tasks are evaluated in a zero-shot setting.
Hardware Specification	No	The paper mentions LLMs like DCLM-7B and LLaMA-3, and discusses 'advanced GPUs already support 2:4 sparse tensor cores', but it does not specify any particular hardware (GPU model, CPU, memory, etc.) used to perform the experiments described within the paper.
Software Dependencies	No	The paper mentions that the evaluation code is based on the 'lm-evaluation-harness repository' but does not specify its version or any other software dependencies with version numbers.
Experiment Setup	Yes	During the self-generation process, we use Top-k and Top-p sampling to improve the diversity of the generated data. Specifically, we set the p-value to 0.95, the k-value to 50, and the temperature to 0.6. We apply the repetition penalty of 1.2 to avoid repeatedly generating low-quality fragments. We randomly sample 5,000 examples from C4, Slimpajama, Wikipedia, and DCLM respectively for generation. In the filtering phase, we eliminate the top 20% of samples based on their perplexity.