Energy-Based Diffusion Language Models for Text Generation

Authors: Minkai Xu, Tomas Geffner, Karsten Kreis, Weili Nie, Yilun Xu, Jure Leskovec, Stefano Ermon, Arash Vahdat

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on language modeling benchmarks show that our model can consistently outperform state-of-the-art diffusion models by a significant margin, and approaches autoregressive models perplexity. We conduct comprehensive experiments on two common language modeling benchmarks to evaluate the performance of our proposed method. Results show that on the perplexity metric, EDLM can consistently achieve state-of-the-art performance among diffusion-based counterparts, and approaches or matches AR models. We provide the experimental setup in section 5.1, describe our results in section 5.2, and provide additional ablation studies in section 5.3.
Researcher Affiliation Collaboration Minkai Xu1 , Tomas Geffner2, Karsten Kreis2, Weili Nie2, Yilun Xu2, Jure Leskovec1, Stefano Ermon1, Arash Vahdat2 1Stanford University 2NVIDIA
Pseudocode Yes A detailed pseudo code for the NCE training process is provided in Algorithm 2. Detailed pseudo code for the sampling procedure is provided in Algorithm 1.
Open Source Code Yes Reproduced code is available at https://github.com/Minkai Xu/Energy-Diffusion-LLM.
Open Datasets Yes We use two text datasets: 1) Text8 (Mahoney, 2006), a relatively small-scale, characterlevel text modeling benchmark extracted from English Wikipedia, and 2) Open Web Text, an opensource replica of the unreleased Web Text (Gokaslan & Cohen, 2019) dataset used to train GPT-2. Our Penn Tree Bank (PTB) (Marcus et al., 1993)), Wikitext (Merity et al., 2016), LM1B, Lambada (Paperno et al., 2016), AG News (Zhang et al., 2015), and Scientific Papers (Pubmed and Arxiv) (Cohan et al., 2018).
Dataset Splits Yes We follow all the common practices in Austin et al. (2021); Campbell et al. (2024) to conduct Text8 experiments... We follow the standard dataset split and train MDLM using a standard 12-layer transformer architecture. We follow the standard data split in (Sahoo et al., 2024) to leave a validation split with the last 100k docs as the validation set.
Hardware Specification No The NCE finetuning process can be done on 4 GPUs for less than 4 hours.
Software Dependencies No For all models including our methods and baselines, we follow the common practice of using standard 12-layer transformers similar to GPT2-small scale (Radford et al., 2019; Shi et al., 2024). We tokenize Open Web Text with the GPT2 tokenizer, with a vocabulary size of around 50K.
Experiment Setup Yes For all models including our methods and baselines, we follow the common practice of using standard 12-layer transformers similar to GPT2-small scale (Radford et al., 2019; Shi et al., 2024). Our proposed EDLM combines two models, the diffusion model pθ and the energy function Eϕ. For all experiments, we use pretrained MDLM (Sahoo et al., 2024) as the diffusion model pθ. For Text8... the transformer also has the same number of heads (12) and hidden dimension (784)... The model is trained text chunks of length 256 for 1 million steps with batch size 512. For both MDLM training and EDLM-NCE finetuning, we follow previous work to use the cosine learning rate schedule with a linear warm-up of 2000 steps. We set the channel-wise dropout rate as 0.05 and conducted optimization with Adam W and a learning rate of 0.0003. We similarly adopt a weight decay factor of 0.03. For Open Web Text... All models are trained with sequences wrapped to a length of 1, 024... All the architectural choices are kept the same with the Text8 experiment, where we use transformers with 12 layers, a hidden dimension of 768, 12 attention heads, and a timestep embedding of 128 when applicable. Word embeddings are not tied between the input and output. Other training details are also kept the same, i.e., we use the Adam W optimizer with a batch size of 512, learning rate 0.0003 with a linear warm-up of 2500 steps. We train all models for 1M steps with the dropout rate reduced to 0.1.