MARGE: Improving Math Reasoning with Guided Exploration
Authors: Jingyue Gao, Runji Lin, Keming Lu, Bowen Yu, Junyang Lin, Jianyu Chen
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments across multiple backbone models and benchmarks, we demonstrate that MARGE significantly improves reasoning capabilities without requiring external annotations or training additional value models. |
| Researcher Affiliation | Collaboration | 1Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China 2Alibaba Group 3Shanghai Qi Zhi Institute, Shanghai, China. Correspondence to: Runji Lin <EMAIL>, Jianyu Chen <EMAIL>. |
| Pseudocode | Yes | A. Algorithm Algorithm 1 MARGE Input: Policy language model πθ; training query set DP; number of episodes M; query batch size B; KL loss coefficient β; Monte Carlo simulation number n; initial responses generation number n1; output reward function r. 1: D generate policy(DP, πθ, n1) Generating n1 hit candidates for all queries 2: for j = 1 . . . M do 3: select guidance solution(D) Sec 3.3: select a guiding solution for each question 4: for query batch Di from D of size B do 5: S get states(Di) get states of the guidance solutions in Di 6: A generate policy(S, πθ, n) Sec 3.3: generate n completions for all states in S with policy πθ 7: V estimate state values(S, A) Sec 3.2: estimate state values 8: π θ train(πθ, D, V ) Sec 3.4: train policy with objective Eq. 3 for DPO or Eq. 4 for RL 9: D update hits(A) Sec 3.3: update guidance candidates with latest responses 10: πθ π θ Use the updated policy 11: end for 12: end for Output: Trained policy πθ |
| Open Source Code | No | No explicit statement or link for the code of the described methodology (MARGE) is provided. The paper mentions utilizing open-source models/frameworks for baselines and implementation details for their method but does not provide a direct link or statement about releasing their own code for MARGE. |
| Open Datasets | Yes | For training, we start with the same subsets of Meta Math QA (Yu et al., 2024) and AQu A (Ling et al., 2017) as in Step DPO. Considering Qwen2.5-7B-Instruct and Qwen2.5-Math-7B-Instruct s already high performance in these tasks, we respectively randomly sample a subset of Omni-Math (Gao et al., 2024) and Big-Math(Albalak et al., 2025) s training set. For evaluation, we test our method on two widely adopted benchmarks: MATH (Hendrycks et al., 2021) and GSM8k (Cobbe et al., 2021), which include questions from grade school level to challenging competition problems. We also incorporate two more challenging datasets, Olympiad Bench (He et al., 2024) and College Math (Tang et al., 2024), to further test our model s generalizability on out-of-distribution challenging problems. |
| Dataset Splits | No | The paper mentions using subsets of Meta Math QA and AQuA for training, and MATH, GSM8k, Olympiad Bench, College Math, and MATH500 for evaluation. While these are standard benchmarks, the paper does not explicitly state the training/test/validation splits (e.g., percentages or exact sample counts) for the datasets used to train their models, beyond mentioning that they use 'the same subsets ... as in Step DPO'. |
| Hardware Specification | Yes | Our experiments are done on 8 A100-80GB GPUs. |
| Software Dependencies | No | The paper mentions using TRL (von Werra et al., 2020), Deep Speed (Rasley et al., 2020), and vLLM (Kwon et al., 2023) for implementation and Open RLHF (Hu et al., 2024) for baselines. However, it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | When collecting rollouts with hit-guided exploration, we set the following sampling parameters: temperature as 0.8, top p as 0.95, top k as 1. For each state, 8 responses are collected. During RL training, we set the learning rate as 1 10 6 and batch size as 1024 to stabilize training. We set the coefficient for KL divergence as 0.01 and train the model with a context length of 2048. We train on the collected dataset for 2 epochs within each iteration. During DPO training, we set β to 0.4. We set the learning rate as 5 10 7 with a batch size of 256. Within each iteration, we train on the collected dataset for 4 epochs, with a maximum length of 2048. Some key parameters for RL baselines PPO and REINFORCE++ are also listed in Tab. 6. |