reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DA-KD: Difficulty-Aware Knowledge Distillation for Efficient Large Language Models

Authors: Changyi He, Yifu Ding, Jinyang Guo, Ruihao Gong, Haotong Qin, Xianglong Liu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that our DA-KD framework is effective and efficient. Without bells and whistles, DA-KD can outperform existing state-of-the-art KD methods by 2% with half training cost and even surpass the teacher model with 4.7 compression. Table 1. Results of task-agnostic instruction following on Llama2 and Qwen2.5 models. We report the average ROUGE-L scores across five random seeds.
Researcher Affiliation	Collaboration	1State Key Laboratory of Complex & Critical Software Environment, Beihang University 2School of Computer Science and Engineering, Beihang University 3School of Artificial Intelligence, Beihang University 4SenseTime Research 5ETH Zurich.
Pseudocode	Yes	Algorithm 1 Difficulty-aware data updating in our DA-KD
Open Source Code	No	The paper does not contain any explicit statement about providing source code, nor does it provide a link to a code repository.
Open Datasets	Yes	For instruction following experiments, we choose the databricks-dolly (Conover et al., 2023) dataset processed by Gu et al. (2024) for distillation. Then, we evaluate the trained student models on five instruction-following datasets: Dolly evaluation (Conover et al., 2023), Self Instruct (Wang et al., 2022a), Super-Natural Instructions (Wang et al., 2022b), Unnatural Instruction(Honovich et al., 2022) and Vicuna evaluation(Chiang et al., 2023). For task-specific experiments, we consider two distinct tasks for evaluation: text summarization using SAMSum (Gliwa et al., 2019), and mathematical reasoning with GSM8K (Cobbe et al., 2021).
Dataset Splits	Yes	For instruction following experiments, we choose the databricks-dolly (Conover et al., 2023) dataset processed by Gu et al. (2024) for distillation. Then, we evaluate the trained student models on five instruction-following datasets: Dolly evaluation (Conover et al., 2023), Self Instruct (Wang et al., 2022a), Super-Natural Instructions (Wang et al., 2022b), Unnatural Instruction(Honovich et al., 2022) and Vicuna evaluation(Chiang et al., 2023). For task-specific experiments, we consider two distinct tasks for evaluation: text summarization using SAMSum (Gliwa et al., 2019), and mathematical reasoning with GSM8K (Cobbe et al., 2021). The use of these widely recognized benchmark datasets implies the use of their standard predefined splits.
Hardware Specification	Yes	All the test cases compress Llama2-7B into a 2.7B model using four NVIDIA A800 GPUs.
Software Dependencies	No	The paper mentions using Adam W optimizer and a cosine learning rate scheduler, but does not provide specific version numbers for any software libraries or frameworks like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	We train all models for 10 epochs using a batch size of 8. We use the Adam W optimizer and a cosine learning rate scheduler. The initial learning rate is set as 1e-5. In our implementation, we set τ and λ as 0.1 and 0.9, respectively.