Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs
Authors: Nicolas Boizard, Kevin El Haddad, CELINE HUDELOT, Pierre Colombo
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate the effectiveness of the ULD loss step-by-step. First, we report in Tab. 2 the aggregated key metrics performance over the different datasets and teacher/student pairs. ULD loss achieves the best overall results, which indicates that the proposed ULD loss effectively improves the performances of every student model on a variety of downstream tasks using any Teacher. Notably, ULD loss exhibits an average improvement of 2.30 points over models trained on teacher-generated text for extractive QA tasks and Bloomz outperforms his teacher Mistral on the QED datasets. |
| Researcher Affiliation | Collaboration | Nicolas Boizard EMAIL Diabolocom, Paris, France MICS, Centrale Supélec, Paris-Saclay University, France Kevin El Haddad EMAIL Diabolocom, Paris, France ISIA Lab University of Mons, Belgium Céline Hudelot EMAIL MICS, Centrale Supélec, Paris-Saclay University, France Pierre Colombo EMAIL Equall.ai MICS, Centrale Supélec, Paris-Saclay University, France |
| Pseudocode | No | The paper describes the Universal Logit Distillation (ULD) loss using mathematical equations (Eq. 4, Eq. 5) and explains its formulation and computation. Appendix A provides a detailed proof of a closed-form solution. However, it does not present a structured pseudocode block or algorithm outlining the procedural steps in a code-like format. |
| Open Source Code | Yes | 3. Contributing to future research. We make our code1, model weights, and generated datasets2 openly available to facilitate future research, minimizing computational overhead and lowering entry barriers. 1https://github.com/Nicolas-BZRD/llm-recipes |
| Open Datasets | Yes | SQu AD (Ext.): The Stanford Question Answering Dataset (SQu AD) (Rajpurkar et al., 2016) is a reading comprehension dataset with 87,600 questions generated by crowdworkers from Wikipedia articles. QED (Ext.): The QED (Lamm et al., 2020) dataset, expertly annotated, extends from a subset of the Google Natural Questions dataset, comprising 7,640 question-answering pairs with explanations. Fairytale QA (Gen.): The Fairytale QA Dataset (Xu et al., 2022), created by educational experts, consists of 10,580 questions from 278 children-friendly stories. Pub Med QA (Gen.): The Pub Med QA (Jin et al., 2019) dataset contains question-answer pairs extracted from medical papers. DIALOGSum (Sum.): Dialog Sum (Chen et al., 2021) is a large-scale dialogue summarization dataset, consisting of 13,460 spoken dialogues with corresponding summaries and topics. |
| Dataset Splits | No | We opted to retain original answers for the test set split. We investigated various scenarios to evaluate the ULD loss performance across different datasets and tasks. These comprised 2 Extractive QA (Ext.), 2 Generative QA (Gen.), and 1 Summary (Sum.) tasks: Evaluations are performed over respective test splits. |
| Hardware Specification | Yes | Finally, distillation was performed in BFLOAT16 mode introduced by Kalamkar et al. (2019), on 4*NVIDIA A100-SXM4-80GB with the Fully Sharded Data Parallel (FSDP) technique (Zhao et al., 2023b). |
| Software Dependencies | No | The paper mentions using the 'POT library (Flamary et al., 2021)' and working in 'BFLOAT16 mode introduced by Kalamkar et al. (2019)'. However, it does not provide specific version numbers for these software components or any other key libraries/frameworks (e.g., PyTorch version) needed for reproducibility. |
| Experiment Setup | Yes | During training, all distillation processes were performed over 5 epochs with a batch size of 8 for the SQu AD dataset and 4 for the others. A one-cycle learning rate scheduler was used with the following configuration for decoder models: max_lr = 1e 6, initial_lr = max_lr/2, min_lr = initial_lr/5. For mt0 (encoderdecoder), the max learning rate parameter varied according to datasets: DIALOGSum: 1e-4, Pub Med QA: 3e-4, and QED: 7e-6. |