Instruction-Following Pruning for Large Language Models
Authors: Bairu Hou, Qibin Chen, Jianyu Wang, Guoli Yin, Chong Wang, Nan Du, Ruoming Pang, Shiyu Chang, Tao Lei
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate the effectiveness of our approach on a wide range of evaluation benchmarks. For example, our 3B activated model improves over the 3B dense model by 5-8 points of absolute margin on domains such as math and coding, and rivals the performance of a 9B model. |
| Researcher Affiliation | Collaboration | Work done while interning at Apple. 1Apple AI/ML 2UC Santa Barbara. Correspondence to: Bairu Hou <EMAIL>, Tao Lei <tao EMAIL>. |
| Pseudocode | No | The paper describes the architecture and training method in text and equations (Section 3.1, 3.2, 3.3) but does not provide a formal pseudocode or algorithm block. |
| Open Source Code | No | The paper states: We use the AXLearn (Apple, 2023) framework and JAX (Bradbury et al., 2018) for model training. The URL provided (https://github.com/apple/axlearn) is for the AXLearn framework, not for the specific implementation of IFPRUNING described in this paper. There is no explicit statement or link indicating that the source code for IFPRUNING is made available. |
| Open Datasets | Yes | We also follow the setup used in TULU 2 and sample additional 800K examples from the FLAN-V2 collection (Chung et al., 2024) to enhance task prompt diversity. ... We include IFEval (Zhou et al., 2023), Alpaca Eval 2.0 (Dubois et al., 2024), and Arena-Hard-Auto (Li et al., 2024) for evaluation. We evaluate the pass@1 performance on Human Eval-python (Chen et al., 2021), mbpp (Austin et al., 2021), and Multi PL-E (Cassano et al., 2022). |
| Dataset Splits | No | Our models are trained on an internal SFT dataset with several million examples. We also follow the setup used in TULU 2 and sample additional 800K examples from the FLAN-V2 collection (Chung et al., 2024) to enhance task prompt diversity. The paper describes the sources of training and evaluation data but does not specify the explicit training/validation/test splits for its internal dataset or how the sampled FLAN-V2 data was partitioned, nor does it explicitly state the use of standard splits for other evaluation benchmarks. |
| Hardware Specification | Yes | We evaluate the latency on a single NVIDIA RTX A6000 GPU and report time-to-first-token (TTFT) and decoding time with input length = 4k, generation length = 100, and sample 4 responses for each query. |
| Software Dependencies | Yes | We use the AXLearn (Apple, 2023) framework and JAX (Bradbury et al., 2018) for model training. The citation for JAX explicitly states 'Jax: composable transformations of python+ numpy programs, v0. 3.13, 2018'. |
| Experiment Setup | Yes | All models are pretrained with a batch size of 2048 and a total number of 5T tokens, except that the DENSE-3B is trained for 9T tokens. The SFT training for the baselines and our method is performed with a batch size of 1024 for 60k training steps. |