DA-KD: Difficulty-Aware Knowledge Distillation for Efficient Large Language Models
Authors: Changyi He, Yifu Ding, Jinyang Guo, Ruihao Gong, Haotong Qin, Xianglong Liu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our DA-KD framework is effective and efficient. Without bells and whistles, DA-KD can outperform existing state-of-the-art KD methods by 2% with half training cost and even surpass the teacher model with 4.7 compression. Table 1. Results of task-agnostic instruction following on Llama2 and Qwen2.5 models. We report the average ROUGE-L scores across five random seeds. |
| Researcher Affiliation | Collaboration | 1State Key Laboratory of Complex & Critical Software Environment, Beihang University 2School of Computer Science and Engineering, Beihang University 3School of Artificial Intelligence, Beihang University 4SenseTime Research 5ETH Zurich. |
| Pseudocode | Yes | Algorithm 1 Difficulty-aware data updating in our DA-KD |
| Open Source Code | No | The paper does not contain any explicit statement about providing source code, nor does it provide a link to a code repository. |
| Open Datasets | Yes | For instruction following experiments, we choose the databricks-dolly (Conover et al., 2023) dataset processed by Gu et al. (2024) for distillation. Then, we evaluate the trained student models on five instruction-following datasets: Dolly evaluation (Conover et al., 2023), Self Instruct (Wang et al., 2022a), Super-Natural Instructions (Wang et al., 2022b), Unnatural Instruction(Honovich et al., 2022) and Vicuna evaluation(Chiang et al., 2023). For task-specific experiments, we consider two distinct tasks for evaluation: text summarization using SAMSum (Gliwa et al., 2019), and mathematical reasoning with GSM8K (Cobbe et al., 2021). |
| Dataset Splits | Yes | For instruction following experiments, we choose the databricks-dolly (Conover et al., 2023) dataset processed by Gu et al. (2024) for distillation. Then, we evaluate the trained student models on five instruction-following datasets: Dolly evaluation (Conover et al., 2023), Self Instruct (Wang et al., 2022a), Super-Natural Instructions (Wang et al., 2022b), Unnatural Instruction(Honovich et al., 2022) and Vicuna evaluation(Chiang et al., 2023). For task-specific experiments, we consider two distinct tasks for evaluation: text summarization using SAMSum (Gliwa et al., 2019), and mathematical reasoning with GSM8K (Cobbe et al., 2021). The use of these widely recognized benchmark datasets implies the use of their standard predefined splits. |
| Hardware Specification | Yes | All the test cases compress Llama2-7B into a 2.7B model using four NVIDIA A800 GPUs. |
| Software Dependencies | No | The paper mentions using Adam W optimizer and a cosine learning rate scheduler, but does not provide specific version numbers for any software libraries or frameworks like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | We train all models for 10 epochs using a batch size of 8. We use the Adam W optimizer and a cosine learning rate scheduler. The initial learning rate is set as 1e-5. In our implementation, we set τ and λ as 0.1 and 0.9, respectively. |