B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

Authors: Weihao Zeng, Yuzhen Huang, Lulu Zhao, Yijun Wang, Zifei Shan, Junxian He

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using mathematical reasoning as a case study, we begin with a quantitative analysis to track the dynamics of exploration and exploitation, discovering that a model s exploratory capabilities rapidly deteriorate over iterations, and the effectiveness of exploiting external rewards diminishes as well. Motivated by these findings, we introduce B-STAR, a Self-Taught Reasoning framework that autonomously adjusts configurations across iterations to Balance exploration and exploitation, thereby optimizing the self-improving effectiveness based on the current policy model and available rewards. Our experiments on mathematical reasoning, coding, and commonsense reasoning demonstrate that B-STAR not only enhances the model s exploratory capabilities throughout training but also achieves a more effective balance between exploration and exploitation, leading to superior performance.
Researcher Affiliation Collaboration 1The Hong Kong University of Science and Technology 2BAAI 3Tencent EMAIL
Pseudocode Yes Algorithm 1 B-STAR
Open Source Code Yes 1We open-source our code at https://github.com/hkust-nlp/B-STa R.
Open Datasets Yes Specifically, we follow Singh et al. (2023) to adopt MATH (Hendrycks et al., 2021) training set as the training data, evaluating on the test split of MATH. We also conduct evaluation on the test split of GSM8K, with experimental results shown in the Appendix A.2. On coding challenges, we follow Singh et al. (2023) and adopt the APPS (Hendrycks et al., 2021) dataset for both training and testing. For commonsense reasoning, following Pang et al. (2024), we conduct experiments on ARC-Challenge (Clark et al., 2018), a dataset consisting of multiple-choice science questions designed to evaluate commonsense reasoning beyond mathematics and coding challenges.
Dataset Splits Yes For the MATH dataset, we follow previous settings (Lightman et al., 2023; Wang et al., 2024b; Sun et al., 2024) by using a subset of 500 representative problems (MATH500) as our test data. We uniformly sample an additional 500 problems for validation and use the remaining 4,000 problems from the MATH test set along with the original 7,500 training problems as our training data.
Hardware Specification No The paper uses large language models like Mistral-7B and Llama-3-8B as base models for experiments and mentions computational resource constraints, but it does not specify any particular hardware components such as GPU or CPU models, or memory details.
Software Dependencies No The paper mentions using Mistral-7B and Llama-3-8B models, but it does not list any specific software libraries, frameworks, or programming languages with their corresponding version numbers used for implementation.
Experiment Setup Yes For SFT, we use Mistral-7B (Jiang et al., 2023) as the base model with a learning rate of 5e-6, a batch size of 128, and train for 3 epochs. ... We then proceed with 9 iterations, where each iteration consists of 500 training steps with a batch size of 128. At the beginning of each iteration, we sample 32 candidate responses for each query, using a temperature of 1.0. ... We train the reward model using the Mistral-7B base, with a learning rate of 2e-6, for 2 epochs. ... We set the reward threshold to 0.0, selecting only those responses with final reward scores exceeding this threshold. ... We set the number of samples per iteration (N) to 67,500 and feed 11,500 MATH training queries (M) per iteration. We set sample size to 64 for all methods. We vary temperature from 0.5 to 1.2 in 0.1 increments and reward threshold from -1.0 to 1.0 in 0.1 increment.