Optimizing Language Models for Inference Time Objectives using Reinforcement Learning
Authors: Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, Remi Munos
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We carry out extensive ablations that showcase the trade-off of different objectives, such as an improved inference time performance when the training algorithm is aware of the inference time algorithm (Section 5): we show that when training on mathematical reasoning datasets such as MATH, as well as challenging code generation datasets such as Code Contests, new algorithmic variants achieve significant gains on inference time objectives of interest. |
| Researcher Affiliation | Collaboration | 1Meta Gen AI 2Meta FAIR. Correspondence to: Yunhao Tang <EMAIL>, Kunhao Zheng <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Online policy optimization |
| Open Source Code | No | The paper does not provide an explicit statement about the release of their source code, nor does it include a link to a code repository for the methodology described. |
| Open Datasets | Yes | Throughout, we focus on the mathematical reasoning dataset MATH (Hendrycks et al., 2021)... We conduct our experiments on Code Contests (Li et al., 2022)... We examine HARP dataset (Yue et al., 2024)... report the performance on another competitive programming benchmark, TACO (Li et al., 2023)... |
| Dataset Splits | Yes | We train on the MATH training set with 7500 examples and evaluate on the test set with 5000 examples (Hendrycks et al., 2021). The original Code Contests training set contains 13328 problems... This results in total 12275 problems which we use to train our model. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. It only mentions the language models used, such as Llama 3. |
| Software Dependencies | No | The paper mentions software like Sympy and Python but does not provide specific version numbers for these or any other key software dependencies. |
| Experiment Setup | Yes | All experiments are conducted with identical hyper-parameter settings: we always apply a batch size of B = 64 prompts per update, and sample k = 4 distinct generations per prompt by default. All training and evaluation sampling are conducted at a temperature of τ = 1 and with top-p = 1. We use a learning rate 2e 7, constant learning rate scheduling with 50 warmup steps and weight decay of 0.1. We sample k = 8 generations per prompt. We update the model with a mini batch size 2 with sequence length 8192 and train in total 8k gradient update steps. |