Efficient Knowledge Injection in LLMs via Self-Distillation
Authors: Kalle Kujanpää, Pekka Marttinen, Harri Valpola, Alexander Ilin
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive evaluations with the Llama-3 (Dubey et al., 2024) and Qwen2.5 (Yang et al., 2024) model families on custom datasets derived from Squadshifts (Miller et al., 2020) and the multi-hop Hotpot QA benchmark (Yang et al., 2018). Our findings show that prompt distillation significantly surpasses standard supervised fine-tuning for knowledge injection and reasoning. |
| Researcher Affiliation | Collaboration | Kalle Kujanpää EMAIL Pekka Marttinen EMAIL Department of Computer Science, Aalto University Finnish Center for Artificial Intelligence (FCAI) Harri Valpola EMAIL Alexander Ilin EMAIL System 2 AI |
| Pseudocode | No | The paper describes the prompt distillation approach and the data generation and distillation steps with mathematical equations for the loss function, but it does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Code available at https://github.com/kallekku/prompt-distillation |
| Open Datasets | Yes | We conduct extensive evaluations with the Llama-3 (Dubey et al., 2024) and Qwen2.5 (Yang et al., 2024) model families on custom datasets derived from Squadshifts (Miller et al., 2020) and the multi-hop Hotpot QA benchmark (Yang et al., 2018). |
| Dataset Splits | Yes | The test set includes 1,000 questions from each Squadshifts variant: Wikipedia, New York Times articles, Reddit posts, and Amazon product reviews. The number of passages used corresponds to the documents for the first 1,000 questions, ranging from 188 (NYT) to 209 (Reddit) (see Table 1). We perform experiments on the four individual subsets separately. To ensure a valid evaluation, test questions must probe knowledge not already known to the base model. To test this, we evaluate the performances of the base models on the test questions (see the base model results in Table 2). We use the first 1,000 questions from the validation set of the Hotpot QA distractor setting for our experiments. |
| Hardware Specification | Yes | We fine-tune the 8B model on one AMD MI250X GPU for 24 hours ( 10 epochs). The 3B model is trained on one GPU and the 14B model on 8 GPUs for five epochs. |
| Software Dependencies | No | The paper mentions models like Llama-3 and Qwen2.5 and frameworks like Lo RA and AdamW, but it does not provide specific version numbers for software libraries or dependencies (e.g., PyTorch version, Hugging Face Transformers version). |
| Experiment Setup | Yes | The student model uses a Lo RA adapter, with rank 1024 for the 3B and 8B models and 512 for the 14B model, applied to all layers. We train all models using Adam W with a learning rate of 10 5, linear LR warmup, and a batch size of 4 per GPU. We fine-tune the 8B model on one AMD MI250X GPU for 24 hours ( 10 epochs). The 3B model is trained on one GPU and the 14B model on 8 GPUs for five epochs. In initial experiments, we exclude regularization due to its added computational cost. At test time, we present each test question individually to the fine-tuned model, sampling an answer with a temperature of 0.25. For the complete set of hyperparameters for prompt distillation, please see Table 8 in Appendix D. |