Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solver
Authors: Zhenting Qi, Mingyuan MA, Jiahang Xu, Li Lyna Zhang, Fan Yang, Mao Yang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across five SLMs demonstrate r Star can effectively solve diverse reasoning problems, including GSM8K, GSM-Hard, MATH, SVAMP, and Strategy QA. Remarkably, r Star boosts GSM8K accuracy from 12.51% to 63.91% for LLa MA2-7B, from 36.46% to 81.88% for Mistral, from 74.53% to 91.13% for LLa MA3-8B-Instruct. |
| Researcher Affiliation | Collaboration | Zhenting Qi Mingyuan Ma Jiahang Xu Li Lyna Zhang Fan Yang Mao Yang Microsoft Research Asia Harvard University |
| Pseudocode | No | The paper describes the MCTS algorithm and its components (selection, expansion, simulations, back-propagation) and provides a mathematical representation for UCT, but it does not include a distinct block explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Code is available at https://github.com/zhentingqi/r Star. |
| Open Datasets | Yes | We test across 5 reasoning tasks, including 4 mathematical tasks (GSM8K (Cobbe et al., 2021), GSM-Hard (Gao et al., 2022), MATH (Hendrycks et al., 2021), SVAMP (Patel et al., 2021)) and one commonsense reasoning task (Strategy QA (Geva et al., 2021)). |
| Dataset Splits | Yes | We test across 5 reasoning tasks, including 4 mathematical tasks (GSM8K (Cobbe et al., 2021), GSM-Hard (Gao et al., 2022), MATH (Hendrycks et al., 2021), SVAMP (Patel et al., 2021)) and one commonsense reasoning task (Strategy QA (Geva et al., 2021)). |
| Hardware Specification | Yes | Currently, completing the 32 rollouts for the entire GSM8K test set takes about 4.5 days on a single A100 GPU per model. |
| Software Dependencies | No | The paper mentions implementing the methods but does not specify any software names with version numbers, such as specific Python libraries or frameworks. |
| Experiment Setup | Yes | In the trajectory self-generation stage, we augment each target SLM with our MCTS, performing 32 rollouts. Except for MATH, where we set the depth d to 8, all other tasks have a d=5. Actions A1 and A3 have a maximum of 5 nodes per depth, while the other actions have a default node count of 1. |