reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

Authors: Weihao Zeng, Yuzhen Huang, Lulu Zhao, Yijun Wang, Zifei Shan, Junxian He

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using mathematical reasoning as a case study, we begin with a quantitative analysis to track the dynamics of exploration and exploitation, discovering that a model s exploratory capabilities rapidly deteriorate over iterations, and the effectiveness of exploiting external rewards diminishes as well. Motivated by these findings, we introduce B-STAR, a Self-Taught Reasoning framework that autonomously adjusts configurations across iterations to Balance exploration and exploitation, thereby optimizing the self-improving effectiveness based on the current policy model and available rewards. Our experiments on mathematical reasoning, coding, and commonsense reasoning demonstrate that B-STAR not only enhances the model s exploratory capabilities throughout training but also achieves a more effective balance between exploration and exploitation, leading to superior performance.
Researcher Affiliation	Collaboration	1The Hong Kong University of Science and Technology 2BAAI 3Tencent EMAIL
Pseudocode	Yes	Algorithm 1 B-STAR
Open Source Code	Yes	1We open-source our code at https://github.com/hkust-nlp/B-STa R.
Open Datasets	Yes	Specifically, we follow Singh et al. (2023) to adopt MATH (Hendrycks et al., 2021) training set as the training data, evaluating on the test split of MATH. We also conduct evaluation on the test split of GSM8K, with experimental results shown in the Appendix A.2. On coding challenges, we follow Singh et al. (2023) and adopt the APPS (Hendrycks et al., 2021) dataset for both training and testing. For commonsense reasoning, following Pang et al. (2024), we conduct experiments on ARC-Challenge (Clark et al., 2018), a dataset consisting of multiple-choice science questions designed to evaluate commonsense reasoning beyond mathematics and coding challenges.
Dataset Splits	Yes	For the MATH dataset, we follow previous settings (Lightman et al., 2023; Wang et al., 2024b; Sun et al., 2024) by using a subset of 500 representative problems (MATH500) as our test data. We uniformly sample an additional 500 problems for validation and use the remaining 4,000 problems from the MATH test set along with the original 7,500 training problems as our training data.
Hardware Specification	No	The paper uses large language models like Mistral-7B and Llama-3-8B as base models for experiments and mentions computational resource constraints, but it does not specify any particular hardware components such as GPU or CPU models, or memory details.
Software Dependencies	No	The paper mentions using Mistral-7B and Llama-3-8B models, but it does not list any specific software libraries, frameworks, or programming languages with their corresponding version numbers used for implementation.
Experiment Setup	Yes	For SFT, we use Mistral-7B (Jiang et al., 2023) as the base model with a learning rate of 5e-6, a batch size of 128, and train for 3 epochs. ... We then proceed with 9 iterations, where each iteration consists of 500 training steps with a batch size of 128. At the beginning of each iteration, we sample 32 candidate responses for each query, using a temperature of 1.0. ... We train the reward model using the Mistral-7B base, with a learning rate of 2e-6, for 2 epochs. ... We set the reward threshold to 0.0, selecting only those responses with final reward scores exceeding this threshold. ... We set the number of samples per iteration (N) to 67,500 and feed 11,500 MATH training queries (M) per iteration. We set sample size to 64 for all methods. We vary temperature from 0.5 to 1.2 in 0.1 increments and reward threshold from -1.0 to 1.0 in 0.1 increment.