Beyond Autoregression: Fast LLMs via Self-Distillation Through Time

Authors: Justin Deschenaux, Caglar Gulcehre

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we demonstrate that diffusion language models are capable of generating at least 32 tokens simultaneously, while exceeding the performance of AR models in text quality and on the LAMBADA natural language understanding benchmark. This outcome is achieved through a novel distillation method for discrete diffusion models, which reduces the number of inference steps by a factor of 32-64. Practically, at the 1.3B parameters scale, diffusion models, even without caching, can generate tokens at a rate that is up to 8 times faster than AR models employing KV-caching, and we anticipate further improvements with the inclusion of caching. Moreover, we demonstrate the efficacy of our approach for diffusion language models with up to 860M parameters.
Researcher Affiliation Academia Justin Deschenaux, Caglar Gulcehre School of Computer and Communication Sciences CLAIRE, EPFL Lausanne, Switzerland EMAIL
Pseudocode Yes Algorithm 1 Computing the Self-Distillation Through Time targets
Open Source Code No Additionally, upon de-anonymization, we will release our code and artifacts.
Open Datasets Yes We distill MDLMs on the Open Web Text dataset (Gokaslan & Cohen, 2019) as it was used to train recent discrete diffusion language models (Lou et al., 2023; Sahoo et al., 2024). ... We evaluate the distilled students on LAMBADA (Paperno et al., 2016) and 6 multiple-choice questions benchmarks from Gao et al. (2021).
Dataset Splits No The paper mentions distilling on the Open Web Text dataset and using specific samples for MAUVE evaluation, but does not provide explicit training/validation/test splits for the model training or distillation process.
Hardware Specification Yes We compute the latency using untrained models with around 1.3B parameters, using the same hyperparameters as Deschenaux & Gulcehre (2024). We use a batch size of 8 and time the sampling 10 times after one warm-up step on a single A100 GPU with 80 Gi B of RAM.
Software Dependencies No The paper mentions "Flash Attention (Dao et al., 2022)" and building on "the open source model of Sahoo et al. (2024)", but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We use the Adam optimizer with a learning rate of 6e 5, a batch size of 128 and no weight decay. We linearly increase the learning rate for 500 training steps and keep it constant afterwards. As a base model, we reuse the checkpoint released by Sahoo et al. (2024). ... We apply iterated SDTT for 7 rounds of 10k training iterations and generate xteacher θ (zt, t, m/k) with 2 sampling steps from the teacher (algorithm 1). We use an exponential moving average (EMA) of the weights with a decay of 0.9999 that we do not reset between rounds.