reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Optimizing Language Models for Inference Time Objectives using Reinforcement Learning

Authors: Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, Remi Munos

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We carry out extensive ablations that showcase the trade-off of different objectives, such as an improved inference time performance when the training algorithm is aware of the inference time algorithm (Section 5): we show that when training on mathematical reasoning datasets such as MATH, as well as challenging code generation datasets such as Code Contests, new algorithmic variants achieve significant gains on inference time objectives of interest.
Researcher Affiliation	Collaboration	1Meta Gen AI 2Meta FAIR. Correspondence to: Yunhao Tang <EMAIL>, Kunhao Zheng <EMAIL>.
Pseudocode	Yes	Algorithm 1 Online policy optimization
Open Source Code	No	The paper does not provide an explicit statement about the release of their source code, nor does it include a link to a code repository for the methodology described.
Open Datasets	Yes	Throughout, we focus on the mathematical reasoning dataset MATH (Hendrycks et al., 2021)... We conduct our experiments on Code Contests (Li et al., 2022)... We examine HARP dataset (Yue et al., 2024)... report the performance on another competitive programming benchmark, TACO (Li et al., 2023)...
Dataset Splits	Yes	We train on the MATH training set with 7500 examples and evaluate on the test set with 5000 examples (Hendrycks et al., 2021). The original Code Contests training set contains 13328 problems... This results in total 12275 problems which we use to train our model.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. It only mentions the language models used, such as Llama 3.
Software Dependencies	No	The paper mentions software like Sympy and Python but does not provide specific version numbers for these or any other key software dependencies.
Experiment Setup	Yes	All experiments are conducted with identical hyper-parameter settings: we always apply a batch size of B = 64 prompts per update, and sample k = 4 distinct generations per prompt by default. All training and evaluation sampling are conducted at a temperature of τ = 1 and with top-p = 1. We use a learning rate 2e 7, constant learning rate scheduling with 50 warmup steps and weight decay of 0.1. We sample k = 8 generations per prompt. We update the model with a mini batch size 2 with sequence length 8192 and train in total 8k gradient update steps.