reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Authors: Marianne Arriola, Aaron Gokaslan, Justin Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sahoo, Volodymyr Kuleshov

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate BD3-LMs across standard language modeling benchmarks and demonstrate their ability to generate arbitrary-length sequences unconditionally. We pre-train a base BD3-LM using the maximum block size L = L for 850K gradient steps and fine-tune under varying L for 150K gradient steps on the One Billion Words dataset (LM1B; Chelba et al. (2014)) and Open Web Text (OWT; Gokaslan et al. (2019)).
Researcher Affiliation	Collaboration	Correspondence to Marianne Arriola: EMAIL Cornell Tech, NY, USA. Stanford University, CA, USA. Cohere, NY, USA.
Pseudocode	Yes	Algorithm 1 Block Diffusion Training Algorithm 2 Block Diffusion Sampling
Open Source Code	Yes	We provide the code1, along with the model weights and blog post on the project page: https://m-arriola.com/bd3lms 1Code: https://github.com/kuleshov-group/bd3lms
Open Datasets	Yes	We conduct experiments on two datasets: The One Billion Word Dataset (LM1B; Chelba et al. (2014)) and Open Web Text (OWT; Gokaslan et al. (2019)).
Dataset Splits	Yes	Models trained on LM1B use the bert-base-uncased tokenizer and a context length of 128. We report perplexities on the test split of LM1B. Models trained on OWT use the GPT2 tokenizer Radford et al. (2019) and a context length of 1024. Since OWT does not have a validation split, we leave the last 100k documents for validation.
Hardware Specification	Yes	We use 3090, A5000, A6000, and A100 GPUs.
Software Dependencies	Yes	Flex Attention (Dong et al., 2024) is a compiler-driven programming model that enables efficient implementation of attention mechanisms with structured sparsity in Py Torch... significantly less memory with up to 5X speedup over the naive native scaled_dot_product_attention implementation in Py Torch ( 2.5) on a A5000 GPU
Experiment Setup	Yes	We use the Adam W optimizer with a batch size of 512 and constant learning rate warmup from 0 to 3e-4 for 2.5K gradient updates. We train a base BD3-LM using the maximum context length L = L for 850K gradient steps. Then, we fine-tune under varying L using the noise schedule optimization for 150K gradient steps on the One Billion Words dataset (LM1B) and Open Web Text (OWT).