AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence

Authors: Yuliang Liu, Junjie Lu, Chaofeng Qu, Zhaoling Chen, Zefan Cai, Jason Klein Liu, Chonghan Liu, Yunhui Xia, Li Zhao, Jiang Bian, Chuheng Zhang, Wei Shen, Zhouhan Lin

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments with Adaptive Step-trained PRMs in mathematical reasoning and code generation show that the outcome PRM achieves state-of-the-art Best-of-N performance, surpassing greedy search strategy with token-level value-guided decoding, while also reducing construction costs by over 30% compared to existing open-source PRMs.
Researcher Affiliation Collaboration 1Nanjing University 2Shanghai Innovation Institute 3University of Technology Sydney 4Independent 5UW-Madison 6MSRA 7Shanghai Jiaotong University. Correspondence to: Chuheng Zhang <EMAIL>, Wei Shen <EMAIL>, Zhouhan Lin <EMAIL>.
Pseudocode No The paper describes the methodology using text and flowcharts (Figure 2), but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We provide our code on https://github.com/Lux0926/ASPRM.
Open Datasets Yes For the mathematical reasoning task, we evaluate on GSM8k (Cobbe et al., 2021) and MATH500 (Lightman et al., 2023) dataset.
Dataset Splits Yes To train ASPRM for code tasks, we collected 1,745 problems from the Leet Code problems as our training set and 175 problems as the test set.
Hardware Specification No The paper does not explicitly provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions models like Mistral-V0.1, Meta Math-Llama-3.1-8B, and Deepseek-Coder-Base, and the Spacy library, but does not provide specific version numbers for general software dependencies (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes Parameter Settings: We sample 30 times per data point and deduplicate the responses in Step 1. For labeling the PRM training data, we perform 8 rollouts per step using the same model π. This process generates 388k PRM training samples.