RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning
Authors: Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, Gabriel Synnaeve
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We benchmark on competitive programming tasks and achieve large performance gains with both small (8B parameters) and large (70B) models, outperforming previous work while reducing the number of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps. |
| Researcher Affiliation | Industry | 1Meta FAIR. Correspondence to: Jonas Gehring <EMAIL>, Gabriel Synnaeve <EMAIL>. |
| Pseudocode | No | The paper describes the reinforcement learning process and reward function mathematically in Section 2.2 and uses code examples in figures and appendices, but it does not contain a dedicated pseudocode block or algorithm section for the RLEF method itself. |
| Open Source Code | No | The text mentions using a third-party code-base for evaluation: "We evaluate candidate solutions with the accompanying code-base of Li et al. (2022)3 using Python 3.10. 3https://github.com/google-deepmind/code_contests". However, it does not provide any explicit statement or link for the open-sourcing of the RLEF methodology described in the paper. |
| Open Datasets | Yes | We perform experiments on the Code Contests benchmark (Li et al., 2022), which requires generating a code solution to a problem specified in natural language along with a textual description of public test cases... Our improvements from RLEF on Code Contests further generalize to Human Eval+ and MBPP+, two popular benchmarks for code synthesis |
| Dataset Splits | Yes | Code Contests consists of a training set and two evaluation sets, valid and test, with 117 and 165 problems, respectively; we use the former for model and hyperparameter selection. We optimize our models on the training set, from which we discard 669 of the 13,328 problems due to missing public or private test cases. |
| Hardware Specification | Yes | We train our models on NVidia H100 GPUs; a training run takes approx. 20 wall time hours. With the above parameters we use 288 (128 for training, 160 for inference) and 2304 (1024 for training, 1280 for inference) GPUs for 8B and 70B models, respectively. |
| Software Dependencies | Yes | We evaluate candidate solutions with the accompanying code-base of Li et al. (2022)3 using Python 3.10. |
| Experiment Setup | Yes | For PPO, we use Adam W (Loshchilov & Hutter, 2019) with a learning rate of 2e 7, weight decay of 0.1, and a linear warm-up over 50 steps. We set the KL regularization factor β of the reward term to 0.05 (Section 2.2)... We set ϵ = 0.2. For optimizing the value function... we set the discount factor γ to 1 and the value clipping threshold α to 0.2. During training, we perform inference with a temperature of 1.0; we use neither nucleus (top-p) nor top-k sampling. We collect 1024 rollouts and perform 4 updates on 256 sequences each. |