Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
Authors: Marianne Arriola, Aaron Gokaslan, Justin Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sahoo, Volodymyr Kuleshov
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate BD3-LMs across standard language modeling benchmarks and demonstrate their ability to generate arbitrary-length sequences unconditionally. We pre-train a base BD3-LM using the maximum block size L = L for 850K gradient steps and fine-tune under varying L for 150K gradient steps on the One Billion Words dataset (LM1B; Chelba et al. (2014)) and Open Web Text (OWT; Gokaslan et al. (2019)). |
| Researcher Affiliation | Collaboration | Correspondence to Marianne Arriola: EMAIL Cornell Tech, NY, USA. Stanford University, CA, USA. Cohere, NY, USA. |
| Pseudocode | Yes | Algorithm 1 Block Diffusion Training Algorithm 2 Block Diffusion Sampling |
| Open Source Code | Yes | We provide the code1, along with the model weights and blog post on the project page: https://m-arriola.com/bd3lms 1Code: https://github.com/kuleshov-group/bd3lms |
| Open Datasets | Yes | We conduct experiments on two datasets: The One Billion Word Dataset (LM1B; Chelba et al. (2014)) and Open Web Text (OWT; Gokaslan et al. (2019)). |
| Dataset Splits | Yes | Models trained on LM1B use the bert-base-uncased tokenizer and a context length of 128. We report perplexities on the test split of LM1B. Models trained on OWT use the GPT2 tokenizer Radford et al. (2019) and a context length of 1024. Since OWT does not have a validation split, we leave the last 100k documents for validation. |
| Hardware Specification | Yes | We use 3090, A5000, A6000, and A100 GPUs. |
| Software Dependencies | Yes | Flex Attention (Dong et al., 2024) is a compiler-driven programming model that enables efficient implementation of attention mechanisms with structured sparsity in Py Torch... significantly less memory with up to 5X speedup over the naive native scaled_dot_product_attention implementation in Py Torch ( 2.5) on a A5000 GPU |
| Experiment Setup | Yes | We use the Adam W optimizer with a batch size of 512 and constant learning rate warmup from 0 to 3e-4 for 2.5K gradient updates. We train a base BD3-LM using the maximum context length L = L for 850K gradient steps. Then, we fine-tune under varying L using the noise schedule optimization for 150K gradient steps on the One Billion Words dataset (LM1B) and Open Web Text (OWT). |