What Makes Large Language Models Reason in (Multi-Turn) Code Generation?
Authors: Kunhao Zheng, Juliette Decugis, Jonas Gehring, Taco Cohen, Benjamin Negrevergne, Gabriel Synnaeve
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We thus investigate the effects of a wide range of prompting strategies with a focus on automatic re-prompting over multiple turns and computational requirements. After systematically decomposing reasoning, instruction, and execution feedback prompts, we conduct an extensive grid search on the competitive programming benchmarks Code Contests and TACO for multiple LLM families and sizes (Llama 3.0 and 3.1, 8B, 70B, 405B, and GPT-4o). Our study reveals strategies that consistently improve performance across all models with small and large sampling budgets. We then show how finetuning with such an optimal configuration allows models to internalize the induced reasoning process and obtain improvements in performance and scalability for multi-turn code generation. |
| Researcher Affiliation | Collaboration | Kunhao Zheng1,2 , Juliette Decugis1 , Jonas Gehring1, Taco Cohen1, Benjamin Negrevergne2, Gabriel Synnaeve1 1Meta AI (FAIR), 2Paris Dauphine University PSL EMAIL |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. While it describes methods and processes, these are presented in natural language or as flow diagrams rather than formal pseudocode. |
| Open Source Code | No | We will release the code for our multi-turn and Co T methods to facilitate reproduction. |
| Open Datasets | Yes | The two benchmarks we use: Code Contests (https://github.com/google-deepmind/code_ contests) and TACO (https://github.com/Flag Open/TACO) are publicly available. |
| Dataset Splits | Yes | Code Contests (Li et al., 2022) contains 13k programming problems in the training set and 117/165 problems in the valid/test set. Each problem contains public tests, private tests, and generated tests. We use public tests to provide execution feedback in the multi-turn setting and use all available tests to evaluate the final submission. (2) TACO (Li et al., 2023b) is a collection of problems sourced from Code Contests, APPS (Hendrycks et al., 2021), and various programming contest platforms. The test set is split into 5 distinct difficulty levels: easy, medium, medium-hard, hard, and very-hard, with each level comprising 200 problems. |
| Hardware Specification | Yes | The end-to-end finetuning takes 170 H100 hours with Tensor Parallelism of size 8 and Fully Sharded Data Parallelism (FSDP). |
| Software Dependencies | No | The paper mentions several software components like 'Llama 3.0 and 3.1' (models) and Python libraries like 'ast' and 'difflib' (Appendix B.1) but does not provide specific version numbers for these or other ancillary software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | We generate with nucleus sampling (Holtzman et al., 2020, top-p=0.95) and a temperature of 1.0 to encourage output diversity. (...) The finetuning uses learning rate 2e 6, 545 steps of gradient updates, sequence length 8192, global batch size 524288 tokens. We use Adam W as the optimizer with weight decay 0.1, β1 = 0.9 and β2 = 0.95. The learning rate schedule is cosine scheduling with 10 warmup steps annealing to 10% of peak learning rate at the end of the training. |