reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond Autoregression: Fast LLMs via Self-Distillation Through Time

Authors: Justin Deschenaux, Caglar Gulcehre

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this study, we demonstrate that diffusion language models are capable of generating at least 32 tokens simultaneously, while exceeding the performance of AR models in text quality and on the LAMBADA natural language understanding benchmark. This outcome is achieved through a novel distillation method for discrete diffusion models, which reduces the number of inference steps by a factor of 32-64. Practically, at the 1.3B parameters scale, diffusion models, even without caching, can generate tokens at a rate that is up to 8 times faster than AR models employing KV-caching, and we anticipate further improvements with the inclusion of caching. Moreover, we demonstrate the efficacy of our approach for diffusion language models with up to 860M parameters.
Researcher Affiliation	Academia	Justin Deschenaux, Caglar Gulcehre School of Computer and Communication Sciences CLAIRE, EPFL Lausanne, Switzerland EMAIL
Pseudocode	Yes	Algorithm 1 Computing the Self-Distillation Through Time targets
Open Source Code	No	Additionally, upon de-anonymization, we will release our code and artifacts.
Open Datasets	Yes	We distill MDLMs on the Open Web Text dataset (Gokaslan & Cohen, 2019) as it was used to train recent discrete diffusion language models (Lou et al., 2023; Sahoo et al., 2024). ... We evaluate the distilled students on LAMBADA (Paperno et al., 2016) and 6 multiple-choice questions benchmarks from Gao et al. (2021).
Dataset Splits	No	The paper mentions distilling on the Open Web Text dataset and using specific samples for MAUVE evaluation, but does not provide explicit training/validation/test splits for the model training or distillation process.
Hardware Specification	Yes	We compute the latency using untrained models with around 1.3B parameters, using the same hyperparameters as Deschenaux & Gulcehre (2024). We use a batch size of 8 and time the sampling 10 times after one warm-up step on a single A100 GPU with 80 Gi B of RAM.
Software Dependencies	No	The paper mentions "Flash Attention (Dao et al., 2022)" and building on "the open source model of Sahoo et al. (2024)", but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	We use the Adam optimizer with a learning rate of 6e 5, a batch size of 128 and no weight decay. We linearly increase the learning rate for 500 training steps and keep it constant afterwards. As a base model, we reuse the checkpoint released by Sahoo et al. (2024). ... We apply iterated SDTT for 7 rounds of 10k training iterations and generate xteacher θ (zt, t, m/k) with 2 sampling steps from the teacher (algorithm 1). We use an exponential moving average (EMA) of the weights with a decay of 0.9999 that we do not reset between rounds.