reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Targeted Low-rank Refinement: Enhancing Sparse Language Models with Precision

Authors: Li Shen, Anke Tang, Yong Luo, Tao Sun, Han Hu, Xiaochun Cao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on LLa MA models validate our method s effectiveness across various pruning techniques and sparsity levels. At 50% sparsity, it reduces perplexity by 53.9% compared to conventional magnitude pruning on LLa MA7B. Section 5 is titled "Experiment" and details evaluations on datasets like Wiki Text-2, Truthful QA, GSM8K, ARC-C, and MMLU, presenting perplexity results in tables and figures.
Researcher Affiliation	Academia	1School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China 2National Engineering Research Center for Multimedia Software, School of Computer Science and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan 430072, Hubei, China 3National University of Defense Technology, Hunan, China 4School of Information and Electronics, Beijing Institute of Technology, Beijing, China 5Key Laboratory of Cyberspace Security, Ministry of Education, China. Correspondence to: Anke Tang <EMAIL>, Yong Luo <EMAIL>. All authors are affiliated with universities or national research institutions, and the provided email addresses are academic domains.
Pseudocode	Yes	Algorithm 1 The Proposed Iterative Weight Update Method 1: Inputs: Dense weight matrix W , binary mask P , target rank k, number of iterations T 2: Initialize S(0) W P 3: for t = 0 to T 1 do 4: L(t) W S(t) 5: Compute SVD: L(t) = U (t)Σ(t)V (t) 6: r(t) 1 + k 1 T 1t 7: S(t+1) S(t) + P n U (t) r(t):Σ(t) r(t):V (t) 8: end for 9: L(T ) W S(T ) 10: Returns: S(T ), L(T )
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or a link to a code repository.
Open Datasets	Yes	We conducted our experiments using LLa MA models, evaluating their performance on the Wiki Text-2 (Merity et al., 2016) and standard benchmarks including Truthful QA (Lin et al., 2021), GSM8K (Cobbe et al., 2021), ARC-C (Clark et al., 2018) and MMLU (Hendrycks et al., 2020). ... When implementing Wanda pruning (Sun et al., 2023) and our method combined with Wanda (Wanda + Ours), we use 128 sequences from the allenai/c4 dataset as calibration data.
Dataset Splits	Yes	For evaluation, we use 128 sequences from Wiki Text-2 dataset for perplexity evaluation. ... When implementing Wanda pruning (Sun et al., 2023) and our method combined with Wanda (Wanda + Ours), we use 128 sequences from the allenai/c4 dataset as calibration data.
Hardware Specification	No	The paper mentions "NVIDIA Ampere GPUs and newer" in Section E.2 but does not specify exact GPU models (e.g., A100, RTX 3090), CPU models, or memory details used for the experiments, which is not specific enough to determine the hardware used for reproduction.
Software Dependencies	No	The paper mentions using 'torch.Tensor' and 'torch.sparse.to sparse semi structured' in Section E.1 and E.2, implying the use of PyTorch, but it does not provide specific version numbers for any software libraries or dependencies.
Experiment Setup	Yes	It is important to note that our proposed iterative refinement method is entirely data-free and does not require calibration data, as shown in Algorithm 1. We consistently use T = 50 across all experiments, which is sufficient for achieving most of the potential error reduction while maintaining computational efficiency. When implementing Wanda pruning (Sun et al., 2023) and our method combined with Wanda (Wanda + Ours), we use 128 sequences from the allenai/c4 dataset as calibration data. For evaluation, we use 128 sequences from Wiki Text-2 dataset for perplexity evaluation. The target rank k is consistently set to 128 for all low-rank refinement methods and sparsity levels.