Beyond Autoregression: Fast LLMs via Self-Distillation Through Time
Authors: Justin Deschenaux, Caglar Gulcehre
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, we demonstrate that diffusion language models are capable of generating at least 32 tokens simultaneously, while exceeding the performance of AR models in text quality and on the LAMBADA natural language understanding benchmark. This outcome is achieved through a novel distillation method for discrete diffusion models, which reduces the number of inference steps by a factor of 32-64. Practically, at the 1.3B parameters scale, diffusion models, even without caching, can generate tokens at a rate that is up to 8 times faster than AR models employing KV-caching, and we anticipate further improvements with the inclusion of caching. Moreover, we demonstrate the efficacy of our approach for diffusion language models with up to 860M parameters. |
| Researcher Affiliation | Academia | Justin Deschenaux, Caglar Gulcehre School of Computer and Communication Sciences CLAIRE, EPFL Lausanne, Switzerland EMAIL |
| Pseudocode | Yes | Algorithm 1 Computing the Self-Distillation Through Time targets |
| Open Source Code | No | Additionally, upon de-anonymization, we will release our code and artifacts. |
| Open Datasets | Yes | We distill MDLMs on the Open Web Text dataset (Gokaslan & Cohen, 2019) as it was used to train recent discrete diffusion language models (Lou et al., 2023; Sahoo et al., 2024). ... We evaluate the distilled students on LAMBADA (Paperno et al., 2016) and 6 multiple-choice questions benchmarks from Gao et al. (2021). |
| Dataset Splits | No | The paper mentions distilling on the Open Web Text dataset and using specific samples for MAUVE evaluation, but does not provide explicit training/validation/test splits for the model training or distillation process. |
| Hardware Specification | Yes | We compute the latency using untrained models with around 1.3B parameters, using the same hyperparameters as Deschenaux & Gulcehre (2024). We use a batch size of 8 and time the sampling 10 times after one warm-up step on a single A100 GPU with 80 Gi B of RAM. |
| Software Dependencies | No | The paper mentions "Flash Attention (Dao et al., 2022)" and building on "the open source model of Sahoo et al. (2024)", but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We use the Adam optimizer with a learning rate of 6e 5, a batch size of 128 and no weight decay. We linearly increase the learning rate for 500 training steps and keep it constant afterwards. As a base model, we reuse the checkpoint released by Sahoo et al. (2024). ... We apply iterated SDTT for 7 rounds of 10k training iterations and generate xteacher θ (zt, t, m/k) with 2 sampling steps from the teacher (algorithm 1). We use an exponential moving average (EMA) of the weights with a decay of 0.9999 that we do not reset between rounds. |