reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Optimizing Test-Time Compute via Meta Reinforcement Finetuning

Authors: Yuxiao Qu, Matthew Y. R. Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, Aviral Kumar

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we evaluate MRT in two settings... on a dataset of math reasoning problems. We find that MRT consistently outperforms outcome-reward RL, achieving state-of-the-art results at the 1.5B parameter scale across multiple benchmarks in aggregate (AIME 2024/2025, AMC 2023, etc.)... Next, we perform controlled experiments to better understand the reasons behind the efficacy of MRT.
Researcher Affiliation	Collaboration	1CMU 2Hugging Face. Correspondence to: Yuxiao Qu <EMAIL>.
Pseudocode	Yes	G.1. Pseudocode Algorithm 1 MRT (STa R) Algorithm 2 MRT (RL)
Open Source Code	No	This work was built upon TRL (von Werra et al., 2020) and Open-R1 (Face, 2025). Our research would not be possible without these open-source projects. We would like to express special thanks to Quentin Gallou edec, Kashif Rasul, and the rest of the Hugging Face team for their invaluable guidance, technical insights, and continuous support with Open-R1 and TRL. This expertise significantly sped up the development of our methods.
Open Datasets	Yes	Empirically, we evaluate MRT in two settings... on a dataset of math reasoning problems. We find that MRT consistently outperforms outcome-reward RL, achieving state-of-the-art results at the 1.5B parameter scale across multiple benchmarks in aggregate (AIME 2024/2025, AMC 2023, etc.)... fine-tuned on 10K randomly sampled problem-solution pairs from Numina Math (Li et al., 2024) and estimate the progress bonus for backtracking by rolling out each prefix 20 times. For the RL variant, we utilized Deep Seek-R1-Distill-Qwen-1.5B and Deep Scale R-1.5B-Preview as base models... finetuned only on 919 AIME problems from 1989-2023.
Dataset Splits	No	The paper mentions training data sizes and evaluation sets: "fine-tuned on 10K randomly sampled problem-solution pairs from Numina Math", "finetuned Deep Seek-R1-Distill-Qwen-1.5B with MRT on 4,000 Numina Math problems", and "finetuned only on 919 AIME problems from 1989-2023". It also refers to evaluation on "AIME 2024/2025" and "AMC 2023". However, it does not provide explicit reproducible dataset split ratios (e.g., 80/10/10) or specific sample counts for training, validation, and test sets for the datasets it used for training (like Numina Math or the 1989-2023 AIME problems).
Hardware Specification	No	The paper mentions 'num gpus 8' in tables detailing hyperparameters (Table 2, 3, 4, 5 in Appendices G.2 and G.3), but it does not specify the model or type of these GPUs (e.g., NVIDIA A100, Tesla V100), nor any other specific hardware components like CPU models or memory.
Software Dependencies	No	The paper states: 'This work was built upon TRL (von Werra et al., 2020) and Open-R1 (Face, 2025). Our research would not be possible without these open-source projects.' While it names TRL and Open-R1 as software used, it does not provide specific version numbers for these libraries or any other key software dependencies.
Experiment Setup	Yes	The paper provides extensive details on the experimental setup, particularly in Appendices G.2 and G.3, under 'Hyperparameters for Open-ended Parameterizations' and 'Hyperparameters for Backtracking Search'. These tables list specific values for: 'learning rate 1.0e-6', 'num train epochs 3', 'batch size 256', 'max seq length 16384', 'bf16 True', 'num gpus 8', 'lr scheduler type cosine', 'warmup ratio 0.1', 'weight decay 0.01', 'max prompt length 4096', 'max completion length 24576', 'num generations 4', 'use vllm True', 'vllm gpu memory utilization 0.8', 'temperature 0.9', 'deepspeed multinode launcher standard zero3 init flag true', and 'zero stage 3'.