Multi-Turn Code Generation Through Single-Step Rewards

Authors: Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental evaluations show that our approach achieves significant improvements over the stateof-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of µCODE at utilizing the execution feedback.
Researcher Affiliation Academia 1Mila Quebec AI Institute 2Universit e de Montr eal 3Cornell University. Correspondence to: Arnav <EMAIL>, Gonzalo <EMAIL>.
Pseudocode Yes Algorithm 1 µCODE: Training input Initial generator π0, multi-turn code environment E, and max iterations M 1: for iteration i = 1 ...M do 2: Rollout generator πθ in multi-turn environment E to collect datapoints Di {(x, st, yt, ot))} 3: Aggregate data D D Di 4: Train a verifier Ri ϕ(x, y) on D 5: Construct a local search expert using verifier πi (x) = arg maxy D(x) βOR(x, y) + βLRϕ(x, y) 6: Relabel data D with πi (x) to get Di 7: Train πi θ with fine-tuning (FT) on Di 8: end for output Best generator πθ and verifier Rϕ
Open Source Code Yes Our code is available here.
Open Datasets Yes We conduct experiments on MBPP (Austin et al., 2021) and Human Eval (Chen et al., 2021) where the agent needs to generate code solutions in Python given natural language descriptions. We train the methods on the MBPP training set which comprises 374 problems and evaluate on the MBPP test set and Human Eval (HE) dataset which have 500 and 164 problems. We also compare methods on the Deep Mind Code Contests dataset (CC, Li et al. (2022a)) where we train on 1000 problems sampled from the training set and evaluate on the 165 problems in the test set.
Dataset Splits Yes We train the methods on the MBPP training set which comprises 374 problems and evaluate on the MBPP test set and Human Eval (HE) dataset which have 500 and 164 problems. We also compare methods on the Deep Mind Code Contests dataset (CC, Li et al. (2022a)) where we train on 1000 problems sampled from the training set and evaluate on the 165 problems in the test set. We further describe the prompts and the split of public and private tests in Appendix C.1 and C.2. For Human Eval, we use a single test from the code prompt s docstring as the public test and the remaining tests along with the official test suite as private tests. For MBPP, we use a single test from the official test suite as the public test, and the remaining tests and any challenge test list tests as private tests.
Hardware Specification Yes All training runs were on machines with either 4 RTX 6000 Ada Generation GPUs for 1B models with 48 GB of memory per GPU or 4 H100 GPUs for 8B models with 80 GB of memory per GPU.
Software Dependencies No The paper mentions 'SGLang (Zheng et al., 2024)' and 'Llama-3.2-1B-Instruct or Llama-3.1-8B-Instruct (Dubey et al., 2024)'. While SGLang is a software, no specific version number is provided for it. The Llama models are specific models but not general software components like Python or PyTorch with version numbers, as requested by the instructions.
Experiment Setup Yes Table 6 contains hyperparameters for training the generator and reward model on both models (Llama-3.1-8B-Instruct and Llama-3.2-1B-Instruct) and datasets (MBPP and Human Eval). We perform 2 iterations of training with µCODE, starting from the base model each iteration. Training Epochs 2 2 Learning Rate 5 10 7 1 10 6 Batch Size 32 64 Max seq length 4096 2048.