reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Instruction-Following Pruning for Large Language Models

Authors: Bairu Hou, Qibin Chen, Jianyu Wang, Guoli Yin, Chong Wang, Nan Du, Ruoming Pang, Shiyu Chang, Tao Lei

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate the effectiveness of our approach on a wide range of evaluation benchmarks. For example, our 3B activated model improves over the 3B dense model by 5-8 points of absolute margin on domains such as math and coding, and rivals the performance of a 9B model.
Researcher Affiliation	Collaboration	Work done while interning at Apple. 1Apple AI/ML 2UC Santa Barbara. Correspondence to: Bairu Hou <EMAIL>, Tao Lei <tao EMAIL>.
Pseudocode	No	The paper describes the architecture and training method in text and equations (Section 3.1, 3.2, 3.3) but does not provide a formal pseudocode or algorithm block.
Open Source Code	No	The paper states: We use the AXLearn (Apple, 2023) framework and JAX (Bradbury et al., 2018) for model training. The URL provided (https://github.com/apple/axlearn) is for the AXLearn framework, not for the specific implementation of IFPRUNING described in this paper. There is no explicit statement or link indicating that the source code for IFPRUNING is made available.
Open Datasets	Yes	We also follow the setup used in TULU 2 and sample additional 800K examples from the FLAN-V2 collection (Chung et al., 2024) to enhance task prompt diversity. ... We include IFEval (Zhou et al., 2023), Alpaca Eval 2.0 (Dubois et al., 2024), and Arena-Hard-Auto (Li et al., 2024) for evaluation. We evaluate the pass@1 performance on Human Eval-python (Chen et al., 2021), mbpp (Austin et al., 2021), and Multi PL-E (Cassano et al., 2022).
Dataset Splits	No	Our models are trained on an internal SFT dataset with several million examples. We also follow the setup used in TULU 2 and sample additional 800K examples from the FLAN-V2 collection (Chung et al., 2024) to enhance task prompt diversity. The paper describes the sources of training and evaluation data but does not specify the explicit training/validation/test splits for its internal dataset or how the sampled FLAN-V2 data was partitioned, nor does it explicitly state the use of standard splits for other evaluation benchmarks.
Hardware Specification	Yes	We evaluate the latency on a single NVIDIA RTX A6000 GPU and report time-to-first-token (TTFT) and decoding time with input length = 4k, generation length = 100, and sample 4 responses for each query.
Software Dependencies	Yes	We use the AXLearn (Apple, 2023) framework and JAX (Bradbury et al., 2018) for model training. The citation for JAX explicitly states 'Jax: composable transformations of python+ numpy programs, v0. 3.13, 2018'.
Experiment Setup	Yes	All models are pretrained with a batch size of 2048 and a total number of 5T tokens, except that the DENSE-3B is trained for 9T tokens. The SFT training for the baselines and our method is performed with a batch size of 1024 for 60k training steps.