reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning

Authors: Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, Lingpeng Kong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On complex tasks like Countdown, Sudoku, and Boolean Satisfiability Problems, MGDM significantly outperforms autoregressive models without using search techniques. For instance, MGDM achieves 91.5% and 100% accuracy on Countdown and Sudoku, respectively, compared to 45.8% and 20.7% for autoregressive models. Our work highlights the potential of diffusionbased approaches in advancing AI capabilities for sophisticated language understanding and problem-solving tasks. All associated codes are available at https://github.com/HKUNLP/diffusion-vs-ar.
Researcher Affiliation	Collaboration	Jiacheng Ye1, Jiahui Gao2 Shansan Gong1, Lin Zheng1 Xin Jiang2, Zhenguo Li2, Lingpeng Kong1 1 The University of Hong Kong 2 Huawei Noah s Ark Lab EMAIL
Pseudocode	Yes	The detailed algorithms for training and inference are illustrated in Algorithm 1 and 2, respectively.
Open Source Code	Yes	All associated codes are available at https://github.com/HKUNLP/diffusion-vs-ar.
Open Datasets	Yes	On complex tasks like Countdown, Sudoku, and Boolean Satisfiability Problems... We follow Gandhi et al. (2024) to generate 500k problems... We collect one million solved games from Park (2016)... we take advantage of the well-studied family of random k-SAT problem (Ding et al., 2015).
Dataset Splits	Yes	We generate 500k problems with target numbers ranging from 10 to 100 and randomly hold out 10% of the targets for out-of-distribution evaluation. ... We collect one million solved games from Park (2016) and use the first 100k as our training set and the subsequent 1k as the testing set. ... generate 50k training data for n = 5, 7 and 100k for n = 9, as well as additional 1k testing data for each n.
Hardware Specification	Yes	We conduct all the experiments on NVIDIA V100-32G GPUs, and we use 8 GPUs for training and sampling.
Software Dependencies	No	No specific versions for key software components like Python, PyTorch, or CUDA are mentioned. The paper refers to architectures like 'GPT-2 architecture' and models like 'LLaMA', but not with specific software dependency versions for replication.
Experiment Setup	Yes	We set the learning rate to 1e-3 for the tiny model and 3e-4 for others, and we set the batch size to 1024 across all the models and tasks. We train MGDM for 1200 epochs on the minimal planning task, 300 epochs on Sudoku, and 600 epochs on other datasets. By default, we set the diffusion sampling steps to T = 20 for tasks with average output tokens larger than 20, otherwise T = 10. We use a decoding temperature τ = 0.5 for all tasks.