Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation

Authors: Zhuohao Yu, Weizheng Gu, Yidong Wang, Xingru Jiang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across 5 models and 3 benchmarks show consistent gains, with 26.9% higher correctness and 42.2% improved code efficiency. The results demonstrate that ORPS enables LLMs to overcome local optima in code generation, suggesting a promising direction for combining verifiable outcomes with structured reasoning to tackle complex challenges.
Researcher Affiliation Academia 1Peking University, Beijing, China 2William & Mary, VA, USA. Correspondence to: Wei Ye <EMAIL>.
Pseudocode Yes As formalized in Algorithm 1 and illustrated in Figure 2, at step t, the node in a given search beam represents a state st = (Rt, Ct, Ft, ωt, Kt, ρt), where Rt denotes the current reasoning chain, Ct the code implementation, Ft the execution feedback, ωt the outcome reward score, Kt the self-critic reasoning and ρt the process reward score.
Open Source Code Yes We open-source at: https: //github.com/zhuohaoyu/ORPS
Open Datasets Yes Datasets. We evaluate on 3 programming benchmarks as shown in Table 1. LBPP is a recent complex programming dataset manually curated by human experts with competitive programming experience. Human Eval and MBPP are popular code generation benchmarks but could be trivial for current LLMs (Matton et al., 2024) (Chen et al., 2021b) (Austin et al., 2021).
Dataset Splits No The paper provides details on the number of 'Test Problems' and 'Unit Tests' for each benchmark (LBPP, Human Eval, MBPP) in Table 1. However, it does not explicitly provide training/validation/test splits for the datasets in the context of model training for the main experiments, as the paper focuses on an inference-only framework using pre-trained LLMs. For the ablation study involving PRM training, it refers to a 'larger GPT-4o labeled dataset' but does not specify its splits.
Hardware Specification Yes All experiments were performed on NVIDIA A800 GPUs, each equipped with 80GB of GPU memory.
Software Dependencies No The paper mentions using 'Free Eval (Yu et al., 2024b) codebase' and 'Hugging Face s text-generation-inference toolkit' for inference, and 'llamafactory (Zheng et al., 2024b)' and 'Deep Speed Ze RO3 (Rajbhandari et al., 2020)' for training PRMs. However, specific version numbers for these software dependencies are not provided in the text.
Experiment Setup Yes The following hyperparameters were used for the search algorithm in ORPS: Search Depth (num rounds): 5. Beam Width (top k): 3. Expansion Factor (num samples): 20. Process Reward Weight (α): 0.5. Outcome Reward Weight (β): 0.5. All inference experiments were conducted on a single machine using the Free Eval (Yu et al., 2024b) codebase, integrated with Hugging Face s text-generation-inference toolkit for efficient model serving. The following inference settings were applied: Maximum Context Length (max tokens): 18,000 tokens. Generated Tokens per Round: 1,500 tokens. To ensure consistent and reproducible results, the following execution constraints were enforced during inference: Timeout per Test Case: 5 seconds. Memory Limit: 512 MB. Maximum Test Cases per Problem: 15. Training Framework: llamafactory (Zheng et al., 2024b). Optimization Framework: Deep Speed Ze RO3 (Rajbhandari et al., 2020) (Zero Redundancy Optimizer Stage 3). Base Model: qwen-2.5-coder-7b-instruct. Batch Size per Device: 2. Gradient Accumulation Steps: 4. Learning Rate: 2 10 5. Learning Rate Scheduler: Cosine decay. Number of Training Epochs: 2.0. Maximum Sequence Length: 16,384 tokens. Mixed Precision Training: Enabled with bf16 (Brain Floating Point 16-bit format).