Generalized Interpolating Discrete Diffusion
Authors: Dimitri Von Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, Thomas Hofmann
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the practical side, in Sections 4 and 5 we apply our theory to the special case of masking noise in combination with varying levels of uniform noise. We conduct an ablation study, showing that our mask-only model achieves compute-matched state-of-the-art on diffusion language modeling thanks to a reweighted training objective (Sec. 5.2). We also show that the addition of uniform noise leads to improved sample quality and unlocks self-correction abilities (Fig. 1, Tab. 1) that allows the model to iteratively improve samples beyond what is possible by simply traversing the backward diffusion process (Sec. 5.4). |
| Researcher Affiliation | Academia | 1Data Analytics Lab, Department of Computer Science, ETH Zurich 2ELLIS Institute T ubingen 3Max Planck Institute for Intelligent Systems, T ubingen. Correspondence to: Dimitri von R utte <EMAIL>. |
| Pseudocode | Yes | A pseudocode implementation is given in Algorithm 1. |
| Open Source Code | Yes | Code: https://github.com/dvruette/gidd/ |
| Open Datasets | Yes | To this end, we adopt the Open Web Text (OWT) dataset (Gokaslan et al., 2019) since there exists a rich literature for both autoregressive and diffusion models trained on this dataset. |
| Dataset Splits | Yes | For computing validation metrics, we reserve the last 100k samples (~1.25%) of the training set (Open Web Text). Validation samples that are longer than the context length are cropped to a random window for consistency with training. For sequences longer than 512 tokens we select a random window of 512 tokens, while short sequences are padded to a length of 512. |
| Hardware Specification | Yes | All models are trained with a context size of 512 tokens and batch size of 512 for 500k steps (resulting in a total of 131B training tokens) on a single node of 8 NVIDIA A100/H100-80GB GPUs in bfloat16 precision using Pytorch s mixed precision training (torch.cuda.autocast). |
| Software Dependencies | No | All models are trained with a context size of 512 tokens and batch size of 512 for 500k steps (resulting in a total of 131B training tokens) on a single node of 8 NVIDIA A100/H100-80GB GPUs in bfloat16 precision using Pytorch s mixed precision training (torch.cuda.autocast). For optimization, we use the Adam optimizer (Kingma & Ba, 2017). |
| Experiment Setup | Yes | All our models are based on the Di T architecture (Peebles & Xie, 2023) and use the GPT2 tokenizer (Radford et al., 2019). We train models of three different sizes: TINY (L = 6, H = 8, d = 512; 28.4M non-emb. params.), SMALL (L = 12, H = 12, d = 768; 92.1M non-emb. params.), and BASE (L = 24, H = 16, d = 1024; 321.2M non-emb. params.), where L denotes the number of layers, H the number of attention heads, and d the dimensionality of hidden states. All models are trained with a context size of 512 tokens and batch size of 512 for 500k steps (resulting in a total of 131B training tokens)... For optimization, we use the Adam optimizer (Kingma & Ba, 2017) with β = (0.9, 0.99), ϵ = 10 9, and a learning rate of 5 10 4. The learning rate is warmed up linearly for the first 10k steps and then decayed using a cosine schedule to 10% of the initial learning rate. We use weight decay 0.0 for our ablations (unless stated otherwise) and 0.02 for the final configuration, also referred to as GIDD+. We also use gradient clipping to a norm of 1.0. |