Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Authors: Marianne Arriola, Aaron Gokaslan, Justin Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sahoo, Volodymyr Kuleshov

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate BD3-LMs across standard language modeling benchmarks and demonstrate their ability to generate arbitrary-length sequences unconditionally. We pre-train a base BD3-LM using the maximum block size L = L for 850K gradient steps and fine-tune under varying L for 150K gradient steps on the One Billion Words dataset (LM1B; Chelba et al. (2014)) and Open Web Text (OWT; Gokaslan et al. (2019)).
Researcher Affiliation Collaboration Correspondence to Marianne Arriola: EMAIL Cornell Tech, NY, USA. Stanford University, CA, USA. Cohere, NY, USA.
Pseudocode Yes Algorithm 1 Block Diffusion Training Algorithm 2 Block Diffusion Sampling
Open Source Code Yes We provide the code1, along with the model weights and blog post on the project page: https://m-arriola.com/bd3lms 1Code: https://github.com/kuleshov-group/bd3lms
Open Datasets Yes We conduct experiments on two datasets: The One Billion Word Dataset (LM1B; Chelba et al. (2014)) and Open Web Text (OWT; Gokaslan et al. (2019)).
Dataset Splits Yes Models trained on LM1B use the bert-base-uncased tokenizer and a context length of 128. We report perplexities on the test split of LM1B. Models trained on OWT use the GPT2 tokenizer Radford et al. (2019) and a context length of 1024. Since OWT does not have a validation split, we leave the last 100k documents for validation.
Hardware Specification Yes We use 3090, A5000, A6000, and A100 GPUs.
Software Dependencies Yes Flex Attention (Dong et al., 2024) is a compiler-driven programming model that enables efficient implementation of attention mechanisms with structured sparsity in Py Torch... significantly less memory with up to 5X speedup over the naive native scaled_dot_product_attention implementation in Py Torch ( 2.5) on a A5000 GPU
Experiment Setup Yes We use the Adam W optimizer with a batch size of 512 and constant learning rate warmup from 0 to 3e-4 for 2.5K gradient updates. We train a base BD3-LM using the maximum context length L = L for 850K gradient steps. Then, we fine-tune under varying L using the noise schedule optimization for 150K gradient steps on the One Billion Words dataset (LM1B) and Open Web Text (OWT).