The Diffusion Duality
Authors: Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin T Chiu, Volodymyr Kuleshov
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Duo on standard language modeling benchmarks, training on LM1B (Chelba et al., 2014) and Open Web Text (OWT) (Gokaslan et al., 2019) with sequence packing (Raffel et al., 2020). We train our models for 1M steps with a batch size of 512 on both datasets. For LM1B, we use a context length of 128 with the bert-base-uncased tokenizer (Devlin et al., 2018) with sequence packing (Arriola et al., 2025; Austin et al., 2021) and without it (Sahoo et al., 2024a; Lou et al., 2023; He et al., 2022). For OWT, we use a context length of 1024 with the GPT-2 tokenizer (Radford et al., 2019). Following Sahoo et al. (2024a), we reserve the last 100K documents for validation. Our model is a 170M-parameter modified diffusion transformer (Di T) (Peebles & Xie, 2023) with rotary positional encoding (Su et al., 2023) and adaptive layer norm for conditioning on diffusion time, consistent with prior work (Lou et al., 2023; Sahoo et al., 2024a). Training is conducted on 8 H100s with bfloat16 precision. We train Duo using (17), which requires computing the integral in (10). To reduce computation overhead, we pre-compute and cache 100K ( αt,T ( αt)) tuples, significantly smaller than the denoising network. The Gaussian diffusion parameter αt is parameterized using a linear schedule i.e., ( αt = 1 − t)t ∈ [0,1]. |
| Researcher Affiliation | Collaboration | 1Computer and Information Science, Cornell Tech, NY, USA. 2School of Computer and Communication Sciences, EPFL Lausanne, Switzerland 3Cohere, NY, USA. |
| Pseudocode | Yes | Algorithm 1 Discrete Consistency Distillation (DCD) ... Algorithm 2 Discrete Consistency Distillation (DCD) with EMA as teacher |
| Open Source Code | Yes | We provide the code and model checkpoints on the project page: https://s-sahoo.com/duo |
| Open Datasets | Yes | We evaluate Duo on standard language modeling benchmarks, training on LM1B (Chelba et al., 2014) and Open Web Text (OWT) (Gokaslan et al., 2019) with sequence packing (Raffel et al., 2020)... our zero-shot datasets include the validation splits of Penn Tree Bank (PTB; Marcus et al. (1993)), Wiki Text (Merity et al., 2016), LM1B, Lambada (Paperno et al., 2016), AG News (Zhang et al., 2015), and Scientific papers from Ar Xiv and Pubmed (Cohan et al., 2018). |
| Dataset Splits | Yes | For LM1B, we use a context length of 128... For OWT, we use a context length of 1024... Following Sahoo et al. (2024a), we reserve the last 100K documents for validation... our zero-shot datasets include the validation splits of Penn Tree Bank (PTB; Marcus et al. (1993)), Wiki Text (Merity et al., 2016), LM1B, Lambada (Paperno et al., 2016), AG News (Zhang et al., 2015), and Scientific papers from Ar Xiv and Pubmed (Cohan et al., 2018). We tokenize Open Web Text with the GPT2 tokenizer. We concatenate and wrap them to a length of 1,024. When wrapping, we add the eos token in-between concatenated sequences. Since Open Web Text does not have a validation split, we leave the last 100k docs as validation. |
| Hardware Specification | Yes | Training is conducted on 8 H100s with bfloat16 precision. |
| Software Dependencies | No | The paper mentions software components like "bert-base-uncased tokenizer", "GPT-2 tokenizer", "Adam W optimizer", and uses "bfloat16 precision", but does not specify exact version numbers for any programming languages (e.g., Python), frameworks (e.g., PyTorch, TensorFlow), or libraries used in the implementation. |
| Experiment Setup | Yes | We train our models for 1M steps with a batch size of 512 on both datasets. For LM1B, we use a context length of 128... For OWT, we use a context length of 1024... Our model is a 170M-parameter modified diffusion transformer (Di T)... with rotary positional encoding... and adaptive layer norm... τ = 0.001 for the first 500K steps... with β = 0.03 and γ = 0.15... and τ = 0 for the remaining steps up to 1M... The Gaussian diffusion parameter αt is parameterized using a linear schedule... We run N = 5 distillation rounds, starting with discretization step δ = 1/512... and doubling it every M = 10K steps... We use 12 layers, a hidden dimension of 768, 12 attention heads, and a timestep embedding of 128... We use the Adam W optimizer with a batch size of 512, constant learning rate warmup from 0 to a learning rate of 3e-4 for 2,500 steps. We use a constant learning rate for 1M, 5M, or 10M steps on One Billion Words, and 1M steps for Open Web Text. We use a dropout rate of 0.1. |