reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

Authors: Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, Gabriel Synnaeve

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark on competitive programming tasks and achieve large performance gains with both small (8B parameters) and large (70B) models, outperforming previous work while reducing the number of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.
Researcher Affiliation	Industry	1Meta FAIR. Correspondence to: Jonas Gehring <EMAIL>, Gabriel Synnaeve <EMAIL>.
Pseudocode	No	The paper describes the reinforcement learning process and reward function mathematically in Section 2.2 and uses code examples in figures and appendices, but it does not contain a dedicated pseudocode block or algorithm section for the RLEF method itself.
Open Source Code	No	The text mentions using a third-party code-base for evaluation: "We evaluate candidate solutions with the accompanying code-base of Li et al. (2022)3 using Python 3.10. 3https://github.com/google-deepmind/code_contests". However, it does not provide any explicit statement or link for the open-sourcing of the RLEF methodology described in the paper.
Open Datasets	Yes	We perform experiments on the Code Contests benchmark (Li et al., 2022), which requires generating a code solution to a problem specified in natural language along with a textual description of public test cases... Our improvements from RLEF on Code Contests further generalize to Human Eval+ and MBPP+, two popular benchmarks for code synthesis
Dataset Splits	Yes	Code Contests consists of a training set and two evaluation sets, valid and test, with 117 and 165 problems, respectively; we use the former for model and hyperparameter selection. We optimize our models on the training set, from which we discard 669 of the 13,328 problems due to missing public or private test cases.
Hardware Specification	Yes	We train our models on NVidia H100 GPUs; a training run takes approx. 20 wall time hours. With the above parameters we use 288 (128 for training, 160 for inference) and 2304 (1024 for training, 1280 for inference) GPUs for 8B and 70B models, respectively.
Software Dependencies	Yes	We evaluate candidate solutions with the accompanying code-base of Li et al. (2022)3 using Python 3.10.
Experiment Setup	Yes	For PPO, we use Adam W (Loshchilov & Hutter, 2019) with a learning rate of 2e 7, weight decay of 0.1, and a linear warm-up over 50 steps. We set the KL regularization factor β of the reward term to 0.05 (Section 2.2)... We set ϵ = 0.2. For optimizing the value function... we set the discount factor γ to 1 and the value clipping threshold α to 0.2. During training, we perform inference with a temperature of 1.0; we use neither nucleus (top-p) nor top-k sampling. We collect 1024 rollouts and perform 4 updates on 256 sequences each.