Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning
Authors: Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, Lingpeng Kong
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On complex tasks like Countdown, Sudoku, and Boolean Satisfiability Problems, MGDM significantly outperforms autoregressive models without using search techniques. For instance, MGDM achieves 91.5% and 100% accuracy on Countdown and Sudoku, respectively, compared to 45.8% and 20.7% for autoregressive models. Our work highlights the potential of diffusionbased approaches in advancing AI capabilities for sophisticated language understanding and problem-solving tasks. All associated codes are available at https://github.com/HKUNLP/diffusion-vs-ar. |
| Researcher Affiliation | Collaboration | Jiacheng Ye1, Jiahui Gao2 Shansan Gong1, Lin Zheng1 Xin Jiang2, Zhenguo Li2, Lingpeng Kong1 1 The University of Hong Kong 2 Huawei Noah s Ark Lab EMAIL |
| Pseudocode | Yes | The detailed algorithms for training and inference are illustrated in Algorithm 1 and 2, respectively. |
| Open Source Code | Yes | All associated codes are available at https://github.com/HKUNLP/diffusion-vs-ar. |
| Open Datasets | Yes | On complex tasks like Countdown, Sudoku, and Boolean Satisfiability Problems... We follow Gandhi et al. (2024) to generate 500k problems... We collect one million solved games from Park (2016)... we take advantage of the well-studied family of random k-SAT problem (Ding et al., 2015). |
| Dataset Splits | Yes | We generate 500k problems with target numbers ranging from 10 to 100 and randomly hold out 10% of the targets for out-of-distribution evaluation. ... We collect one million solved games from Park (2016) and use the first 100k as our training set and the subsequent 1k as the testing set. ... generate 50k training data for n = 5, 7 and 100k for n = 9, as well as additional 1k testing data for each n. |
| Hardware Specification | Yes | We conduct all the experiments on NVIDIA V100-32G GPUs, and we use 8 GPUs for training and sampling. |
| Software Dependencies | No | No specific versions for key software components like Python, PyTorch, or CUDA are mentioned. The paper refers to architectures like 'GPT-2 architecture' and models like 'LLaMA', but not with specific software dependency versions for replication. |
| Experiment Setup | Yes | We set the learning rate to 1e-3 for the tiny model and 3e-4 for others, and we set the batch size to 1024 across all the models and tasks. We train MGDM for 1200 epochs on the minimal planning task, 300 epochs on Sudoku, and 600 epochs on other datasets. By default, we set the diffusion sampling steps to T = 20 for tasks with average output tokens larger than 20, otherwise T = 10. We use a decoding temperature τ = 0.5 for all tasks. |