reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs

Authors: Nicolas Boizard, Kevin El Haddad, CELINE HUDELOT, Pierre Colombo

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate the effectiveness of the ULD loss step-by-step. First, we report in Tab. 2 the aggregated key metrics performance over the different datasets and teacher/student pairs. ULD loss achieves the best overall results, which indicates that the proposed ULD loss effectively improves the performances of every student model on a variety of downstream tasks using any Teacher. Notably, ULD loss exhibits an average improvement of 2.30 points over models trained on teacher-generated text for extractive QA tasks and Bloomz outperforms his teacher Mistral on the QED datasets.
Researcher Affiliation	Collaboration	Nicolas Boizard EMAIL Diabolocom, Paris, France MICS, Centrale Supélec, Paris-Saclay University, France Kevin El Haddad EMAIL Diabolocom, Paris, France ISIA Lab University of Mons, Belgium Céline Hudelot EMAIL MICS, Centrale Supélec, Paris-Saclay University, France Pierre Colombo EMAIL Equall.ai MICS, Centrale Supélec, Paris-Saclay University, France
Pseudocode	No	The paper describes the Universal Logit Distillation (ULD) loss using mathematical equations (Eq. 4, Eq. 5) and explains its formulation and computation. Appendix A provides a detailed proof of a closed-form solution. However, it does not present a structured pseudocode block or algorithm outlining the procedural steps in a code-like format.
Open Source Code	Yes	3. Contributing to future research. We make our code1, model weights, and generated datasets2 openly available to facilitate future research, minimizing computational overhead and lowering entry barriers. 1https://github.com/Nicolas-BZRD/llm-recipes
Open Datasets	Yes	SQu AD (Ext.): The Stanford Question Answering Dataset (SQu AD) (Rajpurkar et al., 2016) is a reading comprehension dataset with 87,600 questions generated by crowdworkers from Wikipedia articles. QED (Ext.): The QED (Lamm et al., 2020) dataset, expertly annotated, extends from a subset of the Google Natural Questions dataset, comprising 7,640 question-answering pairs with explanations. Fairytale QA (Gen.): The Fairytale QA Dataset (Xu et al., 2022), created by educational experts, consists of 10,580 questions from 278 children-friendly stories. Pub Med QA (Gen.): The Pub Med QA (Jin et al., 2019) dataset contains question-answer pairs extracted from medical papers. DIALOGSum (Sum.): Dialog Sum (Chen et al., 2021) is a large-scale dialogue summarization dataset, consisting of 13,460 spoken dialogues with corresponding summaries and topics.
Dataset Splits	No	We opted to retain original answers for the test set split. We investigated various scenarios to evaluate the ULD loss performance across different datasets and tasks. These comprised 2 Extractive QA (Ext.), 2 Generative QA (Gen.), and 1 Summary (Sum.) tasks: Evaluations are performed over respective test splits.
Hardware Specification	Yes	Finally, distillation was performed in BFLOAT16 mode introduced by Kalamkar et al. (2019), on 4*NVIDIA A100-SXM4-80GB with the Fully Sharded Data Parallel (FSDP) technique (Zhao et al., 2023b).
Software Dependencies	No	The paper mentions using the 'POT library (Flamary et al., 2021)' and working in 'BFLOAT16 mode introduced by Kalamkar et al. (2019)'. However, it does not provide specific version numbers for these software components or any other key libraries/frameworks (e.g., PyTorch version) needed for reproducibility.
Experiment Setup	Yes	During training, all distillation processes were performed over 5 epochs with a batch size of 8 for the SQu AD dataset and 4 for the others. A one-cycle learning rate scheduler was used with the following configuration for decoder models: max_lr = 1e 6, initial_lr = max_lr/2, min_lr = initial_lr/5. For mt0 (encoderdecoder), the max learning rate parameter varied according to datasets: DIALOGSum: 1e-4, Pub Med QA: 3e-4, and QED: 7e-6.