Simple and Effective Masked Diffusion Language Models
Authors: Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, Volodymyr Kuleshov
NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On language modeling benchmarks, a range of masked diffusion models trained with modern engineering practices achieves a new state-of-the-art among diffusion models, and approaches AR perplexity. |
| Researcher Affiliation | Academia | Subham Sekhar Sahoo Cornell Tech, NYC, USA. EMAIL Marianne Arriola Cornell Tech, NYC, USA. EMAIL Yair Schiff Cornell Tech, NYC, USA. EMAIL Aaron Gokaslan Cornell Tech, NYC, USA. EMAIL Edgar Marroquin Cornell Tech, NYC, USA. EMAIL Justin T Chiu Cornell Tech, NYC, USA. EMAIL Alexander Rush Cornell Tech, NYC, USA. EMAIL Volodymyr Kuleshov Cornell Tech, NYC, USA. EMAIL |
| Pseudocode | Yes | Algorithm 1 Training MDLM 1: repeat 2: x1:L q(x) Sample a sentence. 3: t U[0,1] Sample a time step. 4: zℓ t Cat(zℓ t;αtxℓ+(1 αt)m) 1 ℓ L Mask Each token xℓindependently to obtain the latent z1:L t . 5: Take gradient descent step on L NELBO =Eq ℓ log xℓ θ(z1:L t ,t),xℓ dt 6: until converged |
| Open Source Code | Yes | We provide the code1, along with a blog post and video tutorial2 on the project page: https://s-sahoo.com/mdlm 1code: https://github.com/kuleshov-group/mdlm |
| Open Datasets | Yes | For language modeling likelihood evaluation, we conduct experiments on two datasets: The One Billion Words Dataset (LM1B; [8]) and Open Web Text (OWT; [18]). |
| Dataset Splits | Yes | Since Open Web Text does not have a validation split, we leave the last 100k docs as validation. |
| Hardware Specification | Yes | We conduct all experiments on 8x 3090s, 8x A6000s, 8x A100s, or 8x H100s. The largest models on Open Web Text take 2 weeks to train on 8x A100, the LM1B models only take 2 days to train on the same hardware. |
| Software Dependencies | No | The paper mentions 'bert-base-uncased tokenizer' and 'GPT2 tokenizer [45]' but does not specify their version numbers or other software dependencies with versions. |
| Experiment Setup | Yes | We use 12 layers, a hidden dimension of 768, 12 attention heads, and a timestep embedding of 128 when applicable. Word embeddings are not tied between the input and output. We use the Adam W optimizer with a batch size of 512, constant learning rate warmup from 0 to a learning rate of 3e-4 for 2,500 steps. We use a constant learning rate for 1M, 5M, or 10M steps on One Billion Words, and 1M steps for Open Web Text. We use a dropout rate of 0.1. |