Optimizing Test-Time Compute via Meta Reinforcement Finetuning
Authors: Yuxiao Qu, Matthew Y. R. Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, Aviral Kumar
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we evaluate MRT in two settings... on a dataset of math reasoning problems. We find that MRT consistently outperforms outcome-reward RL, achieving state-of-the-art results at the 1.5B parameter scale across multiple benchmarks in aggregate (AIME 2024/2025, AMC 2023, etc.)... Next, we perform controlled experiments to better understand the reasons behind the efficacy of MRT. |
| Researcher Affiliation | Collaboration | 1CMU 2Hugging Face. Correspondence to: Yuxiao Qu <EMAIL>. |
| Pseudocode | Yes | G.1. Pseudocode Algorithm 1 MRT (STa R) Algorithm 2 MRT (RL) |
| Open Source Code | No | This work was built upon TRL (von Werra et al., 2020) and Open-R1 (Face, 2025). Our research would not be possible without these open-source projects. We would like to express special thanks to Quentin Gallou edec, Kashif Rasul, and the rest of the Hugging Face team for their invaluable guidance, technical insights, and continuous support with Open-R1 and TRL. This expertise significantly sped up the development of our methods. |
| Open Datasets | Yes | Empirically, we evaluate MRT in two settings... on a dataset of math reasoning problems. We find that MRT consistently outperforms outcome-reward RL, achieving state-of-the-art results at the 1.5B parameter scale across multiple benchmarks in aggregate (AIME 2024/2025, AMC 2023, etc.)... fine-tuned on 10K randomly sampled problem-solution pairs from Numina Math (Li et al., 2024) and estimate the progress bonus for backtracking by rolling out each prefix 20 times. For the RL variant, we utilized Deep Seek-R1-Distill-Qwen-1.5B and Deep Scale R-1.5B-Preview as base models... finetuned only on 919 AIME problems from 1989-2023. |
| Dataset Splits | No | The paper mentions training data sizes and evaluation sets: "fine-tuned on 10K randomly sampled problem-solution pairs from Numina Math", "finetuned Deep Seek-R1-Distill-Qwen-1.5B with MRT on 4,000 Numina Math problems", and "finetuned only on 919 AIME problems from 1989-2023". It also refers to evaluation on "AIME 2024/2025" and "AMC 2023". However, it does not provide explicit reproducible dataset split ratios (e.g., 80/10/10) or specific sample counts for training, validation, and test sets for the datasets it used for training (like Numina Math or the 1989-2023 AIME problems). |
| Hardware Specification | No | The paper mentions 'num gpus 8' in tables detailing hyperparameters (Table 2, 3, 4, 5 in Appendices G.2 and G.3), but it does not specify the model or type of these GPUs (e.g., NVIDIA A100, Tesla V100), nor any other specific hardware components like CPU models or memory. |
| Software Dependencies | No | The paper states: 'This work was built upon TRL (von Werra et al., 2020) and Open-R1 (Face, 2025). Our research would not be possible without these open-source projects.' While it names TRL and Open-R1 as software used, it does not provide specific version numbers for these libraries or any other key software dependencies. |
| Experiment Setup | Yes | The paper provides extensive details on the experimental setup, particularly in Appendices G.2 and G.3, under 'Hyperparameters for Open-ended Parameterizations' and 'Hyperparameters for Backtracking Search'. These tables list specific values for: 'learning rate 1.0e-6', 'num train epochs 3', 'batch size 256', 'max seq length 16384', 'bf16 True', 'num gpus 8', 'lr scheduler type cosine', 'warmup ratio 0.1', 'weight decay 0.01', 'max prompt length 4096', 'max completion length 24576', 'num generations 4', 'use vllm True', 'vllm gpu memory utilization 0.8', 'temperature 0.9', 'deepspeed multinode launcher standard zero3 init flag true', and 'zero stage 3'. |