Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Authors: Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei BI, Jiawei Han, Hao Peng, Lingpeng Kong

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through systematic evaluation on language modeling, reasoning, and commonsense benchmarks, we show that we can convert AR models ranging from 127M to 7B parameters (GPT2 and LLa MA) into diffusion models Diffu GPT and Diffu LLa MA, using less than 200B tokens for training. Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts.
Researcher Affiliation Collaboration 1The University of Hong Kong 2 University of Illinois at Urbana-Champaign 3 Apple 4 Tencent AI Lab
Pseudocode Yes Algorithm 1 Adaptation Training; Algorithm 2 Sampling
Open Source Code Yes We release a suite of DLMs (127M-355M-7B) capable of generating fluent text, performing in-context learning, filling in the middle without prompt re-ordering, and following instructions. https://github.com/HKUNLP/Diffu LLa MA
Open Datasets Yes We use the 30 billion tokens random split from the Fine Web dataset (Penedo et al., 2024), an improved corpus than Open Web Text (Gokaslan & Cohen, 2019) used in prior DLMs (Lou et al., 2024), to continue training GPT2 base (Radford et al., 2019). We continue pre-training LLAMA-2-7-HF (Touvron et al., 2023a) on a mixture of Slim Pajama (70%) (Soboleva et al., 2023) and Starcoder (30%) (Li et al., 2023a) data following Tiny LLa MA (Zhang et al., 2024a). We consider Trivia QA (Joshi et al., 2017) to test the reading comprehension of models and last word completion task Lambada (Paperno et al., 2016) to test how models capture long-range dependencies in text. We also test for common sense reasoning tasks Hella Swag (Zellers et al., 2019), Winogrande (Sakaguchi et al., 2021), SIQA (Sap et al., 2019) and PIQA (Bisk et al., 2020), all of which involve multiple-choice questions assessed by accuracy. On grade school math problems GSM8K (Cobbe et al., 2021), we follow Ye et al. (2024b) in finetuning setting using the augmented symbolic data to test the Co T (Wei et al., 2022b) math reasoning abilities of diffusion models. Following Shen et al. (2023), we also test the story infilling tasks using ROCStories (Mostafazadeh et al., 2016) and evaluate using ROUGE score (Lin, 2004). To test the code infilling, we adopt Humaneval (Bavarian et al., 2022a) single line infilling task, which is evaluated by pass@1 rate. We evaluate Diffu LLa MA s math reasoning and in-context learning ability by evaluating on MAWPS (Koncel Kedziorski et al., 2016) consisting of math word problems and SATMATH from AGI-eval consisting of math problems from SAT exam (Zhong et al., 2024).
Dataset Splits Yes We use the 30 billion tokens random split from the Fine Web dataset (Penedo et al., 2024)... We randomly sample 65 billion tokens from this mixture and use sequence packing with context length of 2048... In trivia QA, we set n to the oracle length plus an additional 10 tokens, and we only evaluate the first 2000 cases in this dataset for efficiency. For ROCstories...We evaluate the first 1000 cases in this dataset for efficiency.
Hardware Specification Yes We implement Diffu GPT using LLa MA-Factory4 with Deep Speed Zero-2 parallelization (Rajbhandari et al., 2020) ... where we use 8 A100 80G. ... With these settings, we get set a batch size of 60 per GPU with context length 2048 on a GH200 96GB GPU. We train our model for 65 billion tokens on 16 4x GH200 nodes.
Software Dependencies No The paper mentions using "LLa MA-Factory", "Deep Speed Zero-2", "huggingface" (transformers library), and "flash-attention 2" but does not provide specific version numbers for these software components.
Experiment Setup Yes We use sequence packing, logits shifting, and 10K-step attention mask annealing to transform GPT2 to Diffu GPT. For both adaptation settings, we employ full parameter finetuning with bf16. ... We use learning rate of 3e-4 with cosine scheduler. The warm up steps are set to 2K and attention mask annealing steps are 10K. ... We use Adam W (Loshchilov & Hutter, 2019) to optimize our models with a constant learning rate of 2e-5 and accumulate gradients every 4 steps.