Efficient Perplexity Bound and Ratio Matching in Discrete Diffusion Language Models
Authors: Etrit Haxholli, Yeti Z. Gurbuz, Oğul Can, Eli Waxman
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | While continuous diffusion models excel in modeling continuous distributions, their application to categorical data has been less effective. Recent work has shown that ratio-matching through score-entropy within a continuous-time discrete Markov chain (CTMC) framework serves as a competitive alternative to autoregressive models in language modeling. To enhance this framework, we first introduce three new theorems concerning the KL divergence between the data and learned distribution. Our results serve as the discrete counterpart to those established for continuous diffusion models and allow us to derive an improved upper bound of the perplexity. Second, we empirically show that ratio-matching performed by minimizing the denoising cross-entropy between the clean and corrupted data enables models to outperform those utilizing score-entropy with up to 10% lower perplexity/generative-perplexity, and 15% faster training steps. |
| Researcher Affiliation | Industry | Etrit Haxholli Yeti Z. G urb uz O gul Can Eli Waxman Meta Dialog Research EMAIL |
| Pseudocode | Yes | Algorithm 1 Cross Entropy Training Algorithm Require: Network fθ, (total) noise schedule σt, data distribution pdata, token transition matrix Qtok, and time t [0, 1]. Sample x0 p0, t U([0, 1]). Construct xt from x0. In particular, xi t pt|0( |xi 0) = exp(σt Qtok)[:,xi 0]. Compute Lll = PL i=1 log f i θ(xt, t)[xi 0]. Backpropagate θLll. Run optimizer. |
| Open Source Code | Yes | Code is available at: https://github.com/Meta Dialog-Research/PBRC |
| Open Datasets | Yes | We compare the generative perplexities (Gen Perp) of identical networks trained using SEDD, SEDDs, CEDD and CEDD*. To evaluate the generative perplexity of a model, we generate samples from that model, and use a GPT-2 large model in order to assess the likelihood of the generated samples. However, this metric can be unreliable, as models such as GPT-2 large are not perfect themselves, and they tend to assign high probability to some unlikely sequences, such as those that contain repetitive tokens. Such biased samples can be generated by increasing the step size, while maintaining the number of reverse steps. To ensure a fair comparison, we evade such approaches. Furthermore, recently Zheng et al. (2024) showed that the sampling procedure in Lou et al. (2024) suffers from numerical precision issues. To address this, they proposed fixing the categorical sampling to 64-bit floating-point precision a strategy that we have also adopted. In Appendix B.6, we also provide perplexity results as evaluated by LLama 3.1 8B (Dubey et al., 2024). We now empirically validate the approaches and theoretical contributions presented in the previous section. In Subsection 4.1, we compare the generative perplexities of models trained on Open Web Text (Gokaslan & Cohen, 2019) with SEDD and SEDD scaled (SEDDs, see Appendix B.3.2 for details) versus those trained with CEDD and CEDD*. We keep all other variables unchanged for a fair comparison, finding that CEDD outperforms SEDD in all cases. The tests are conducted for the absorb, uniform and roulette diffusion dynamics. In Subsection 4.2, we evaluate the perplexity of the models, by calculating the upper bound on 5 different datasets, namely: 1BW, LAMBADA, PTB, Wikitext2 and Wikitext103 (Chelba et al., 2013; Paperno et al., 2016; Marcus et al., 1993; Merity et al., 2016). |
| Dataset Splits | No | The paper mentions using 'Open Web Text' for training and 'Wiki Text-103' for evaluation. It also lists '1BW, LAMBADA, PTB, Wikitext2 and Wikitext103' for perplexity evaluation. It states: 'We do not shuffle the test set.' And 'The test set is contaminated with mistakes (5% of characters)'. However, specific percentages or counts for training/validation/test splits, or explicit references to predefined standard splits are not provided. |
| Hardware Specification | Yes | In terms of hyperparameters, the model was trained on a single H100 when the sequence length is set at 128, while in the case of sequence lengths of 1024 the model is trained using 8 H100 with a vocabulary size of 50,257 tokens. |
| Software Dependencies | No | The paper states: 'Training utilizes the Adam W optimizer with a learning rate of 0.0003, beta parameters of 0.9 and 0.999, and epsilon set to 1e-8'. It also mentions 'GPT-2 large model' and 'LLama 3.1 8B'. However, it does not specify version numbers for any general software dependencies like Python, PyTorch, or CUDA, which would be crucial for reproduction. |
| Experiment Setup | Yes | The network is configured with 12 transformer blocks, each featuring 12 attention heads and a hidden size of 768, aligning with the small variant of GPT-2. It includes conditioning dimensions set at 128 to facilitate the diffusion process by encoding time-dependent features. Notably, the architecture excludes masking, typical of generative models that generate all tokens simultaneously rather than sequentially. It uses standard scaled dot-product attention mechanisms and incorporates a dropout rate of 0.1 to mitigate overfitting. In terms of hyperparameters, the model was trained on a single H100 when the sequence length is set at 128, while in the case of sequence lengths of 1024 the model is trained using 8 H100 with a vocabulary size of 50,257 tokens. Training involves a batch size of 512. The training regime is designed for a total of 400,000 iterations. Training utilizes the Open Web Text dataset, while evaluation is conducted on Wiki Text-103, with data managed locally to speed up access times. The noise schedule for the diffusion process is loglinear (uniform, absorb), and roulette log-linear (roulette) controlling the variance of noise added incrementally. In both cases we set ϵ = 0.001 as in (Lou et al., 2024). Sampling for evaluation during training employs an Euler predictor over 128 (and 1024 when L = 1024) steps, with noise removal enabled. For optimization, the model uses the Adam W optimizer with a learning rate of 0.0003, beta parameters of 0.9 and 0.999, and epsilon set to 1e-8. It features no weight decay, focusing on adapting learning without additional regularization. The optimizer includes a warm-up phase of 2,500 steps to stabilize learning dynamics, and employs gradient clipping at a threshold of 1 to prevent gradients from exploding during training. The log-linear noise schedule was used in the absorb and uniform case, while the roulette log-linear one was used in the roulette case. |