reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Diffusion Duality

Authors: Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin T Chiu, Volodymyr Kuleshov

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Duo on standard language modeling benchmarks, training on LM1B (Chelba et al., 2014) and Open Web Text (OWT) (Gokaslan et al., 2019) with sequence packing (Raffel et al., 2020). We train our models for 1M steps with a batch size of 512 on both datasets. For LM1B, we use a context length of 128 with the bert-base-uncased tokenizer (Devlin et al., 2018) with sequence packing (Arriola et al., 2025; Austin et al., 2021) and without it (Sahoo et al., 2024a; Lou et al., 2023; He et al., 2022). For OWT, we use a context length of 1024 with the GPT-2 tokenizer (Radford et al., 2019). Following Sahoo et al. (2024a), we reserve the last 100K documents for validation. Our model is a 170M-parameter modified diffusion transformer (Di T) (Peebles & Xie, 2023) with rotary positional encoding (Su et al., 2023) and adaptive layer norm for conditioning on diffusion time, consistent with prior work (Lou et al., 2023; Sahoo et al., 2024a). Training is conducted on 8 H100s with bfloat16 precision. We train Duo using (17), which requires computing the integral in (10). To reduce computation overhead, we pre-compute and cache 100K ( αt,T ( αt)) tuples, significantly smaller than the denoising network. The Gaussian diffusion parameter αt is parameterized using a linear schedule i.e., ( αt = 1 − t)t ∈ [0,1].
Researcher Affiliation	Collaboration	1Computer and Information Science, Cornell Tech, NY, USA. 2School of Computer and Communication Sciences, EPFL Lausanne, Switzerland 3Cohere, NY, USA.
Pseudocode	Yes	Algorithm 1 Discrete Consistency Distillation (DCD) ... Algorithm 2 Discrete Consistency Distillation (DCD) with EMA as teacher
Open Source Code	Yes	We provide the code and model checkpoints on the project page: https://s-sahoo.com/duo
Open Datasets	Yes	We evaluate Duo on standard language modeling benchmarks, training on LM1B (Chelba et al., 2014) and Open Web Text (OWT) (Gokaslan et al., 2019) with sequence packing (Raffel et al., 2020)... our zero-shot datasets include the validation splits of Penn Tree Bank (PTB; Marcus et al. (1993)), Wiki Text (Merity et al., 2016), LM1B, Lambada (Paperno et al., 2016), AG News (Zhang et al., 2015), and Scientific papers from Ar Xiv and Pubmed (Cohan et al., 2018).
Dataset Splits	Yes	For LM1B, we use a context length of 128... For OWT, we use a context length of 1024... Following Sahoo et al. (2024a), we reserve the last 100K documents for validation... our zero-shot datasets include the validation splits of Penn Tree Bank (PTB; Marcus et al. (1993)), Wiki Text (Merity et al., 2016), LM1B, Lambada (Paperno et al., 2016), AG News (Zhang et al., 2015), and Scientific papers from Ar Xiv and Pubmed (Cohan et al., 2018). We tokenize Open Web Text with the GPT2 tokenizer. We concatenate and wrap them to a length of 1,024. When wrapping, we add the eos token in-between concatenated sequences. Since Open Web Text does not have a validation split, we leave the last 100k docs as validation.
Hardware Specification	Yes	Training is conducted on 8 H100s with bfloat16 precision.
Software Dependencies	No	The paper mentions software components like "bert-base-uncased tokenizer", "GPT-2 tokenizer", "Adam W optimizer", and uses "bfloat16 precision", but does not specify exact version numbers for any programming languages (e.g., Python), frameworks (e.g., PyTorch, TensorFlow), or libraries used in the implementation.
Experiment Setup	Yes	We train our models for 1M steps with a batch size of 512 on both datasets. For LM1B, we use a context length of 128... For OWT, we use a context length of 1024... Our model is a 170M-parameter modified diffusion transformer (Di T)... with rotary positional encoding... and adaptive layer norm... τ = 0.001 for the first 500K steps... with β = 0.03 and γ = 0.15... and τ = 0 for the remaining steps up to 1M... The Gaussian diffusion parameter αt is parameterized using a linear schedule... We run N = 5 distillation rounds, starting with discretization step δ = 1/512... and doubling it every M = 10K steps... We use 12 layers, a hidden dimension of 768, 12 attention heads, and a timestep embedding of 128... We use the Adam W optimizer with a batch size of 512, constant learning rate warmup from 0 to a learning rate of 3e-4 for 2,500 steps. We use a constant learning rate for 1M, 5M, or 10M steps on One Billion Words, and 1M steps for Open Web Text. We use a dropout rate of 0.1.