reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

Authors: Kunhao Zheng, Juliette Decugis, Jonas Gehring, Taco Cohen, Benjamin Negrevergne, Gabriel Synnaeve

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We thus investigate the effects of a wide range of prompting strategies with a focus on automatic re-prompting over multiple turns and computational requirements. After systematically decomposing reasoning, instruction, and execution feedback prompts, we conduct an extensive grid search on the competitive programming benchmarks Code Contests and TACO for multiple LLM families and sizes (Llama 3.0 and 3.1, 8B, 70B, 405B, and GPT-4o). Our study reveals strategies that consistently improve performance across all models with small and large sampling budgets. We then show how finetuning with such an optimal configuration allows models to internalize the induced reasoning process and obtain improvements in performance and scalability for multi-turn code generation.
Researcher Affiliation	Collaboration	Kunhao Zheng1,2 , Juliette Decugis1 , Jonas Gehring1, Taco Cohen1, Benjamin Negrevergne2, Gabriel Synnaeve1 1Meta AI (FAIR), 2Paris Dauphine University PSL EMAIL
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks. While it describes methods and processes, these are presented in natural language or as flow diagrams rather than formal pseudocode.
Open Source Code	No	We will release the code for our multi-turn and Co T methods to facilitate reproduction.
Open Datasets	Yes	The two benchmarks we use: Code Contests (https://github.com/google-deepmind/code_ contests) and TACO (https://github.com/Flag Open/TACO) are publicly available.
Dataset Splits	Yes	Code Contests (Li et al., 2022) contains 13k programming problems in the training set and 117/165 problems in the valid/test set. Each problem contains public tests, private tests, and generated tests. We use public tests to provide execution feedback in the multi-turn setting and use all available tests to evaluate the final submission. (2) TACO (Li et al., 2023b) is a collection of problems sourced from Code Contests, APPS (Hendrycks et al., 2021), and various programming contest platforms. The test set is split into 5 distinct difficulty levels: easy, medium, medium-hard, hard, and very-hard, with each level comprising 200 problems.
Hardware Specification	Yes	The end-to-end finetuning takes 170 H100 hours with Tensor Parallelism of size 8 and Fully Sharded Data Parallelism (FSDP).
Software Dependencies	No	The paper mentions several software components like 'Llama 3.0 and 3.1' (models) and Python libraries like 'ast' and 'difflib' (Appendix B.1) but does not provide specific version numbers for these or other ancillary software dependencies, which is required for reproducibility.
Experiment Setup	Yes	We generate with nucleus sampling (Holtzman et al., 2020, top-p=0.95) and a temperature of 1.0 to encourage output diversity. (...) The finetuning uses learning rate 2e 6, 545 steps of gradient updates, sequence length 8192, global batch size 524288 tokens. We use Adam W as the optimizer with weight decay 0.1, β1 = 0.9 and β2 = 0.95. The learning rate schedule is cosine scheduling with 10 warmup steps annealing to 10% of peak learning rate at the end of the training.