Think while You Generate: Discrete Diffusion with Planned Denoising

Authors: Sulin Liu, Juno Nam, Andrew Campbell, Hannes Stärk, Yilun Xu, Tommi Jaakkola, Rafael Gomez-Bombarelli

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental DDPD outperforms traditional denoiser-only mask diffusion methods, achieving superior results on language modeling benchmarks such as text8, Open Web Text, and token-based generation on Image Net 256 256. Notably, in language modeling, DDPD significantly reduces the performance gap between diffusion-based and autoregressive methods in terms of generative perplexity. ... In experiments on GPT-2 scale language modeling and 256 256 image token generation, DDPD significantly outperforms its mask diffusion counterparts when using the same denoiser.
Researcher Affiliation Collaboration Sulin Liu1 , Juno Nam1, Andrew Campbell2, Hannes Stärk1, Yilun Xu3, Tommi Jaakkola1, Rafael Gómez-Bombarelli1 1Massachusetts Institute of Technology, 2University of Oxford, 3NVIDIA Research
Pseudocode Yes The pseudo-algorithm for our proposed sampling method is presented in Algorithm 1.
Open Source Code Yes Code is available at github.com/liusulin/DDPD.
Open Datasets Yes DDPD outperforms traditional denoiser-only mask diffusion methods, achieving superior results on language modeling benchmarks such as text8, Open Web Text, and token-based generation on Image Net 256 256. ... We represent images with discrete-valued tokens using a pre-trained tokenizer and decoder from Yu et al. [45]
Dataset Splits No The paper uses well-known datasets such as text8, Open Web Text, and ImageNet, which typically have predefined splits. However, it does not explicitly state the specific training/validation/test splits used for its experiments (e.g., percentages, sample counts, or direct citations for their specific split methodology) in the main text. For text8, it mentions "Our experimental setup follows that of [7]", which may imply using the same splits, but this is not explicitly detailed within the paper itself.
Hardware Specification Yes We trained our models on four A100 80GB GPUs, and it takes around 100 hours to finish training for 750k iterations. ... We trained our planner models on nodes with four A100 80GB GPUs for 400k iterations.
Software Dependencies No The paper mentions "Py Torch pseudocode" and the use of "Adam W [28]" optimizer. However, it does not provide specific version numbers for PyTorch or any other software libraries or dependencies, which are necessary for reproducible descriptions.
Experiment Setup Yes For all models, we used an effective batch size of 2048 with micro-batch 512 accumulated every 4 steps. For optimization, we used Adam W [28] with a weight decay factor of 0.1. Learning rate was linearly warmed up to 10-4 over 1000 steps, and decayed using a cosine schedule to 10-5 at 1M steps. We used the total training step budget of 750k steps. ... We used Adam W [28] with a weight decay factor of 0, and the learning rate was linearly warmed up to 3e-4 over the first 2500 steps and then held constant. EMA with a decay factor of 0.999 was applied to the model parameters. ... The planner is trained with batch size 2048 for 400k iterations... Adam W [28] optimizer with a weight decay factor of 0.03, β1 = 0.9, and β2 = 0.96, and a learning rate of 2e-4. The learning rate schedule included a linear warmup over the first 10k steps, followed by cosine annealing down to a final learning rate of 1e-5. EMA was applied with a decay factor of 0.999.