Generalized Interpolating Discrete Diffusion

Authors: Dimitri Von Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, Thomas Hofmann

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On the practical side, in Sections 4 and 5 we apply our theory to the special case of masking noise in combination with varying levels of uniform noise. We conduct an ablation study, showing that our mask-only model achieves compute-matched state-of-the-art on diffusion language modeling thanks to a reweighted training objective (Sec. 5.2). We also show that the addition of uniform noise leads to improved sample quality and unlocks self-correction abilities (Fig. 1, Tab. 1) that allows the model to iteratively improve samples beyond what is possible by simply traversing the backward diffusion process (Sec. 5.4).
Researcher Affiliation Academia 1Data Analytics Lab, Department of Computer Science, ETH Zurich 2ELLIS Institute T ubingen 3Max Planck Institute for Intelligent Systems, T ubingen. Correspondence to: Dimitri von R utte <EMAIL>.
Pseudocode Yes A pseudocode implementation is given in Algorithm 1.
Open Source Code Yes Code: https://github.com/dvruette/gidd/
Open Datasets Yes To this end, we adopt the Open Web Text (OWT) dataset (Gokaslan et al., 2019) since there exists a rich literature for both autoregressive and diffusion models trained on this dataset.
Dataset Splits Yes For computing validation metrics, we reserve the last 100k samples (~1.25%) of the training set (Open Web Text). Validation samples that are longer than the context length are cropped to a random window for consistency with training. For sequences longer than 512 tokens we select a random window of 512 tokens, while short sequences are padded to a length of 512.
Hardware Specification Yes All models are trained with a context size of 512 tokens and batch size of 512 for 500k steps (resulting in a total of 131B training tokens) on a single node of 8 NVIDIA A100/H100-80GB GPUs in bfloat16 precision using Pytorch s mixed precision training (torch.cuda.autocast).
Software Dependencies No All models are trained with a context size of 512 tokens and batch size of 512 for 500k steps (resulting in a total of 131B training tokens) on a single node of 8 NVIDIA A100/H100-80GB GPUs in bfloat16 precision using Pytorch s mixed precision training (torch.cuda.autocast). For optimization, we use the Adam optimizer (Kingma & Ba, 2017).
Experiment Setup Yes All our models are based on the Di T architecture (Peebles & Xie, 2023) and use the GPT2 tokenizer (Radford et al., 2019). We train models of three different sizes: TINY (L = 6, H = 8, d = 512; 28.4M non-emb. params.), SMALL (L = 12, H = 12, d = 768; 92.1M non-emb. params.), and BASE (L = 24, H = 16, d = 1024; 321.2M non-emb. params.), where L denotes the number of layers, H the number of attention heads, and d the dimensionality of hidden states. All models are trained with a context size of 512 tokens and batch size of 512 for 500k steps (resulting in a total of 131B training tokens)... For optimization, we use the Adam optimizer (Kingma & Ba, 2017) with β = (0.9, 0.99), ϵ = 10 9, and a learning rate of 5 10 4. The learning rate is warmed up linearly for the first 10k steps and then decayed using a cosine schedule to 10% of the initial learning rate. We use weight decay 0.0 for our ablations (unless stated otherwise) and 0.02 for the final configuration, also referred to as GIDD+. We also use gradient clipping to a norm of 1.0.