Multi-Turn Code Generation Through Single-Step Rewards
Authors: Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental evaluations show that our approach achieves significant improvements over the stateof-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of µCODE at utilizing the execution feedback. |
| Researcher Affiliation | Academia | 1Mila Quebec AI Institute 2Universit e de Montr eal 3Cornell University. Correspondence to: Arnav <EMAIL>, Gonzalo <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 µCODE: Training input Initial generator π0, multi-turn code environment E, and max iterations M 1: for iteration i = 1 ...M do 2: Rollout generator πθ in multi-turn environment E to collect datapoints Di {(x, st, yt, ot))} 3: Aggregate data D D Di 4: Train a verifier Ri ϕ(x, y) on D 5: Construct a local search expert using verifier πi (x) = arg maxy D(x) βOR(x, y) + βLRϕ(x, y) 6: Relabel data D with πi (x) to get Di 7: Train πi θ with fine-tuning (FT) on Di 8: end for output Best generator πθ and verifier Rϕ |
| Open Source Code | Yes | Our code is available here. |
| Open Datasets | Yes | We conduct experiments on MBPP (Austin et al., 2021) and Human Eval (Chen et al., 2021) where the agent needs to generate code solutions in Python given natural language descriptions. We train the methods on the MBPP training set which comprises 374 problems and evaluate on the MBPP test set and Human Eval (HE) dataset which have 500 and 164 problems. We also compare methods on the Deep Mind Code Contests dataset (CC, Li et al. (2022a)) where we train on 1000 problems sampled from the training set and evaluate on the 165 problems in the test set. |
| Dataset Splits | Yes | We train the methods on the MBPP training set which comprises 374 problems and evaluate on the MBPP test set and Human Eval (HE) dataset which have 500 and 164 problems. We also compare methods on the Deep Mind Code Contests dataset (CC, Li et al. (2022a)) where we train on 1000 problems sampled from the training set and evaluate on the 165 problems in the test set. We further describe the prompts and the split of public and private tests in Appendix C.1 and C.2. For Human Eval, we use a single test from the code prompt s docstring as the public test and the remaining tests along with the official test suite as private tests. For MBPP, we use a single test from the official test suite as the public test, and the remaining tests and any challenge test list tests as private tests. |
| Hardware Specification | Yes | All training runs were on machines with either 4 RTX 6000 Ada Generation GPUs for 1B models with 48 GB of memory per GPU or 4 H100 GPUs for 8B models with 80 GB of memory per GPU. |
| Software Dependencies | No | The paper mentions 'SGLang (Zheng et al., 2024)' and 'Llama-3.2-1B-Instruct or Llama-3.1-8B-Instruct (Dubey et al., 2024)'. While SGLang is a software, no specific version number is provided for it. The Llama models are specific models but not general software components like Python or PyTorch with version numbers, as requested by the instructions. |
| Experiment Setup | Yes | Table 6 contains hyperparameters for training the generator and reward model on both models (Llama-3.1-8B-Instruct and Llama-3.2-1B-Instruct) and datasets (MBPP and Human Eval). We perform 2 iterations of training with µCODE, starting from the base model each iteration. Training Epochs 2 2 Learning Rate 5 10 7 1 10 6 Batch Size 32 64 Max seq length 4096 2048. |