reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Knowledge Injection in LLMs via Self-Distillation

Authors: Kalle Kujanpää, Pekka Marttinen, Harri Valpola, Alexander Ilin

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive evaluations with the Llama-3 (Dubey et al., 2024) and Qwen2.5 (Yang et al., 2024) model families on custom datasets derived from Squadshifts (Miller et al., 2020) and the multi-hop Hotpot QA benchmark (Yang et al., 2018). Our findings show that prompt distillation significantly surpasses standard supervised fine-tuning for knowledge injection and reasoning.
Researcher Affiliation	Collaboration	Kalle Kujanpää EMAIL Pekka Marttinen EMAIL Department of Computer Science, Aalto University Finnish Center for Artificial Intelligence (FCAI) Harri Valpola EMAIL Alexander Ilin EMAIL System 2 AI
Pseudocode	No	The paper describes the prompt distillation approach and the data generation and distillation steps with mathematical equations for the loss function, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Code available at https://github.com/kallekku/prompt-distillation
Open Datasets	Yes	We conduct extensive evaluations with the Llama-3 (Dubey et al., 2024) and Qwen2.5 (Yang et al., 2024) model families on custom datasets derived from Squadshifts (Miller et al., 2020) and the multi-hop Hotpot QA benchmark (Yang et al., 2018).
Dataset Splits	Yes	The test set includes 1,000 questions from each Squadshifts variant: Wikipedia, New York Times articles, Reddit posts, and Amazon product reviews. The number of passages used corresponds to the documents for the first 1,000 questions, ranging from 188 (NYT) to 209 (Reddit) (see Table 1). We perform experiments on the four individual subsets separately. To ensure a valid evaluation, test questions must probe knowledge not already known to the base model. To test this, we evaluate the performances of the base models on the test questions (see the base model results in Table 2). We use the first 1,000 questions from the validation set of the Hotpot QA distractor setting for our experiments.
Hardware Specification	Yes	We fine-tune the 8B model on one AMD MI250X GPU for 24 hours ( 10 epochs). The 3B model is trained on one GPU and the 14B model on 8 GPUs for five epochs.
Software Dependencies	No	The paper mentions models like Llama-3 and Qwen2.5 and frameworks like Lo RA and AdamW, but it does not provide specific version numbers for software libraries or dependencies (e.g., PyTorch version, Hugging Face Transformers version).
Experiment Setup	Yes	The student model uses a Lo RA adapter, with rank 1024 for the 3B and 8B models and 512 for the 14B model, applied to all layers. We train all models using Adam W with a learning rate of 10 5, linear LR warmup, and a batch size of 4 per GPU. We fine-tune the 8B model on one AMD MI250X GPU for 24 hours ( 10 epochs). The 3B model is trained on one GPU and the 14B model on 8 GPUs for five epochs. In initial experiments, we exclude regularization due to its added computational cost. At test time, we present each test question individually to the fine-tuned model, sampling an answer with a temperature of 0.25. For the complete set of hyperparameters for prompt distillation, please see Table 8 in Appendix D.