reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Authors: Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, Mao Yang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across four SLMs (1.5B-7B) and seven math reasoning tasks demonstrate the effectiveness of r Star-Math. Remarkably, r Star-Math improves all four SLMs, matching or even surpassing Open AI o1 on challenging math benchmarks. On MATH benchmark, with 8 search trajectories, r Star-Math boosts Qwen2.5-Math-7B from 58.8% to 89.4% and Qwen2.5-Math-1.5B from 51.2% to 87.8%. With 64 trajectories, the scores rise to 90% and 88.4%, outperforming o1-preview by 4.5% and 2.6% and matching o1-mini s 90%. On the Olympiad-level AIME 2024, r Star-Math solves on average 53.3% (8/15) of the problems, exceeding o1-preview by 8.7% and all other open-sourced LLMs. We further conduct comprehensive experiments to verify the superiority of step-by-step verified reasoning trajectories over state-of-the-art data synthesis baselines, as well as the PPM s effectiveness compared to outcome reward models and Q value-based PRMs.
Researcher Affiliation	Collaboration	1Microsoft Research Asia 2Peking University 3University of Science and Technology of China 4Tsinghua University; Xinyu Guan, Yifei Liu and Youran Sun did this work during the internship at MSRA. Correspondence to: Li Lyna Zhang <EMAIL>.
Pseudocode	No	The paper describes the methodology using diagrams (Fig. 1) and prose in Section 3, detailing processes like MCTS-driven exploration, code-augmented CoT generation, and self-evolution rounds. However, it does not include a distinct block labeled "Pseudocode" or "Algorithm" with structured steps.
Open Source Code	Yes	Code and data are available at https://github.com/microsoft/rStar.
Open Datasets	Yes	We collect a large dataset of 747k math word problems with ground-truth answers, primarily from Numina Math (Jia LI & Polu, 2024a) and Meta Math (Yu et al., 2023b).
Dataset Splits	No	The paper mentions collecting a dataset of 747k math word problems and using them to train models. It categorizes problems by difficulty during MCTS and selects correct trajectories for training. It also discusses various evaluation benchmarks. However, it does not explicitly provide the specific training, validation, and test splits (e.g., percentages or exact counts) for the 747k collected problems used for training its own policy SLMs and PPMs. The evaluation details specific benchmarks like MATH, AIME, etc., but these are external test sets, not internal splits of the primary training data.
Hardware Specification	Yes	In the initial bootstrap round, we use Deep Seek-Coder-v2-Instruct (236B) as the policy model, using 10 nodes of 8 80GB H100 GPUs with 8 MCTS rollouts. This required approximately two weeks to finish the data generation. For rounds 2 4, using our fine-tuned 7B SLM as the policy model, data generation was performed on 15 nodes of 4 40GB A100 GPUs, with each round completed in three days.
Software Dependencies	No	The paper mentions using Python for code execution and libraries like `sympy`, `numpy` and `itertools` in code examples (Figure 2, Appendix A.1, A.3). However, it does not specify explicit version numbers for Python itself or any of these libraries, which is required for reproducible software details.
Experiment Setup	Yes	All policy SLMs are trained for 2 epochs with a sequence length of 4096 tokens and a batch size of 128. We use Adam W optimizer with a linear learning rate scheduler, setting the initial learning rate to 7e-6 for Qwen models, and a cosine scheduler with an initial learning rate of 5e-6 for Phi3-mini-Instruct. The PPM is trained for 1 epoch with a batch size of 512 and an initial learning rate of 7e-6.