Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning

Authors: Han Zhong, Yutong Yin, Shenao Zhang, Xiaojun Xu, Yuanxin Liu, Yifei Zuo, Zhihan Liu, Boyi Liu, Sirui Zheng, Hongyi Guo, Liwei Wang, Mingyi Hong, Zhaoran Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations on math and coding benchmarks demonstrate that our approach consistently improves performance across different base models without requiring human-annotated thinking processes.
Researcher Affiliation Collaboration 1Center for Dada Science, Peking University 2Northwestern University 3Bytedance Research 4University of Minnesota. Correspondence to: Han Zhong <EMAIL>.
Pseudocode No The paper describes the BRi TE algorithm in prose within Section 3.2, but it does not present a structured pseudocode block or a clearly labeled algorithm figure.
Open Source Code No The paper refers to using open-source instruction-tuned LLMs and mentions external repositories for datasets, but it does not explicitly state that the source code for the BRi TE methodology is released or provide a link to it.
Open Datasets Yes To evaluate mathematical reasoning capabilities, we conduct experiments on two prominent benchmarks: GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021). ... RUC-AIBOX/STILL-3-Preview-RL-Data2, MATH, and historical AIME problems (excluding AIME 2024). ... The first 4000 rows of the educational instruct split of the dataset Open Coder-LLM/opc-sft-stage2(Huang et al., 2024) as the training dataset...
Dataset Splits No The training sets of GSM8K and MATH datasets each contain approximately 7,500 data points. ... We choose the first 4000 rows of the educational instruct split of the dataset Open Coder-LLM/opc-sft-stage2(Huang et al., 2024) as the training dataset... The paper mentions training and test sets but does not specify the exact split percentages or sample counts for training, validation, and testing as applied in their experiments.
Hardware Specification Yes The BRi TE algorithm is run on 4 NVIDIA H100 during training. ... We use 4 NVIDIA A100 GPUs for all the training.
Software Dependencies No The paper mentions leveraging the PPO pipeline and adopting Lo RA training but does not provide specific version numbers for these or other software libraries used in their implementation.
Experiment Setup Yes We leverage the PPO pipeline (Schulman et al., 2017) to learn the sampling policy Q (3.4) with a learning rate of 5e 7 and a batch size of 1. For the subsequent SFT on rationales sampled by Q, we set the learning rate to 5e 5 and the batch size to 2. We adopt the Lo RA (Hu et al., 2021) training for both steps, where the r is set to 32 and lora alpha is set to 128.