reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Perplexity Bound and Ratio Matching in Discrete Diffusion Language Models

Authors: Etrit Haxholli, Yeti Z. Gurbuz, Oğul Can, Eli Waxman

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	While continuous diffusion models excel in modeling continuous distributions, their application to categorical data has been less effective. Recent work has shown that ratio-matching through score-entropy within a continuous-time discrete Markov chain (CTMC) framework serves as a competitive alternative to autoregressive models in language modeling. To enhance this framework, we first introduce three new theorems concerning the KL divergence between the data and learned distribution. Our results serve as the discrete counterpart to those established for continuous diffusion models and allow us to derive an improved upper bound of the perplexity. Second, we empirically show that ratio-matching performed by minimizing the denoising cross-entropy between the clean and corrupted data enables models to outperform those utilizing score-entropy with up to 10% lower perplexity/generative-perplexity, and 15% faster training steps.
Researcher Affiliation	Industry	Etrit Haxholli Yeti Z. G urb uz O gul Can Eli Waxman Meta Dialog Research EMAIL
Pseudocode	Yes	Algorithm 1 Cross Entropy Training Algorithm Require: Network fθ, (total) noise schedule σt, data distribution pdata, token transition matrix Qtok, and time t [0, 1]. Sample x0 p0, t U([0, 1]). Construct xt from x0. In particular, xi t pt\|0( \|xi 0) = exp(σt Qtok)[:,xi 0]. Compute Lll = PL i=1 log f i θ(xt, t)[xi 0]. Backpropagate θLll. Run optimizer.
Open Source Code	Yes	Code is available at: https://github.com/Meta Dialog-Research/PBRC
Open Datasets	Yes	We compare the generative perplexities (Gen Perp) of identical networks trained using SEDD, SEDDs, CEDD and CEDD. To evaluate the generative perplexity of a model, we generate samples from that model, and use a GPT-2 large model in order to assess the likelihood of the generated samples. However, this metric can be unreliable, as models such as GPT-2 large are not perfect themselves, and they tend to assign high probability to some unlikely sequences, such as those that contain repetitive tokens. Such biased samples can be generated by increasing the step size, while maintaining the number of reverse steps. To ensure a fair comparison, we evade such approaches. Furthermore, recently Zheng et al. (2024) showed that the sampling procedure in Lou et al. (2024) suffers from numerical precision issues. To address this, they proposed fixing the categorical sampling to 64-bit floating-point precision a strategy that we have also adopted. In Appendix B.6, we also provide perplexity results as evaluated by LLama 3.1 8B (Dubey et al., 2024). We now empirically validate the approaches and theoretical contributions presented in the previous section. In Subsection 4.1, we compare the generative perplexities of models trained on Open Web Text (Gokaslan & Cohen, 2019) with SEDD and SEDD scaled (SEDDs, see Appendix B.3.2 for details) versus those trained with CEDD and CEDD. We keep all other variables unchanged for a fair comparison, finding that CEDD outperforms SEDD in all cases. The tests are conducted for the absorb, uniform and roulette diffusion dynamics. In Subsection 4.2, we evaluate the perplexity of the models, by calculating the upper bound on 5 different datasets, namely: 1BW, LAMBADA, PTB, Wikitext2 and Wikitext103 (Chelba et al., 2013; Paperno et al., 2016; Marcus et al., 1993; Merity et al., 2016).
Dataset Splits	No	The paper mentions using 'Open Web Text' for training and 'Wiki Text-103' for evaluation. It also lists '1BW, LAMBADA, PTB, Wikitext2 and Wikitext103' for perplexity evaluation. It states: 'We do not shuffle the test set.' And 'The test set is contaminated with mistakes (5% of characters)'. However, specific percentages or counts for training/validation/test splits, or explicit references to predefined standard splits are not provided.
Hardware Specification	Yes	In terms of hyperparameters, the model was trained on a single H100 when the sequence length is set at 128, while in the case of sequence lengths of 1024 the model is trained using 8 H100 with a vocabulary size of 50,257 tokens.
Software Dependencies	No	The paper states: 'Training utilizes the Adam W optimizer with a learning rate of 0.0003, beta parameters of 0.9 and 0.999, and epsilon set to 1e-8'. It also mentions 'GPT-2 large model' and 'LLama 3.1 8B'. However, it does not specify version numbers for any general software dependencies like Python, PyTorch, or CUDA, which would be crucial for reproduction.
Experiment Setup	Yes	The network is configured with 12 transformer blocks, each featuring 12 attention heads and a hidden size of 768, aligning with the small variant of GPT-2. It includes conditioning dimensions set at 128 to facilitate the diffusion process by encoding time-dependent features. Notably, the architecture excludes masking, typical of generative models that generate all tokens simultaneously rather than sequentially. It uses standard scaled dot-product attention mechanisms and incorporates a dropout rate of 0.1 to mitigate overfitting. In terms of hyperparameters, the model was trained on a single H100 when the sequence length is set at 128, while in the case of sequence lengths of 1024 the model is trained using 8 H100 with a vocabulary size of 50,257 tokens. Training involves a batch size of 512. The training regime is designed for a total of 400,000 iterations. Training utilizes the Open Web Text dataset, while evaluation is conducted on Wiki Text-103, with data managed locally to speed up access times. The noise schedule for the diffusion process is loglinear (uniform, absorb), and roulette log-linear (roulette) controlling the variance of noise added incrementally. In both cases we set ϵ = 0.001 as in (Lou et al., 2024). Sampling for evaluation during training employs an Euler predictor over 128 (and 1024 when L = 1024) steps, with noise removal enabled. For optimization, the model uses the Adam W optimizer with a learning rate of 0.0003, beta parameters of 0.9 and 0.999, and epsilon set to 1e-8. It features no weight decay, focusing on adapting learning without additional regularization. The optimizer includes a warm-up phase of 2,500 steps to stabilize learning dynamics, and employs gradient clipping at a threshold of 1 to prevent gradients from exploding during training. The log-linear noise schedule was used in the absorb and uniform case, while the roulette log-linear one was used in the roulette case.