AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence
Authors: Yuliang Liu, Junjie Lu, Chaofeng Qu, Zhaoling Chen, Zefan Cai, Jason Klein Liu, Chonghan Liu, Yunhui Xia, Li Zhao, Jiang Bian, Chuheng Zhang, Wei Shen, Zhouhan Lin
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments with Adaptive Step-trained PRMs in mathematical reasoning and code generation show that the outcome PRM achieves state-of-the-art Best-of-N performance, surpassing greedy search strategy with token-level value-guided decoding, while also reducing construction costs by over 30% compared to existing open-source PRMs. |
| Researcher Affiliation | Collaboration | 1Nanjing University 2Shanghai Innovation Institute 3University of Technology Sydney 4Independent 5UW-Madison 6MSRA 7Shanghai Jiaotong University. Correspondence to: Chuheng Zhang <EMAIL>, Wei Shen <EMAIL>, Zhouhan Lin <EMAIL>. |
| Pseudocode | No | The paper describes the methodology using text and flowcharts (Figure 2), but does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We provide our code on https://github.com/Lux0926/ASPRM. |
| Open Datasets | Yes | For the mathematical reasoning task, we evaluate on GSM8k (Cobbe et al., 2021) and MATH500 (Lightman et al., 2023) dataset. |
| Dataset Splits | Yes | To train ASPRM for code tasks, we collected 1,745 problems from the Leet Code problems as our training set and 175 problems as the test set. |
| Hardware Specification | No | The paper does not explicitly provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions models like Mistral-V0.1, Meta Math-Llama-3.1-8B, and Deepseek-Coder-Base, and the Spacy library, but does not provide specific version numbers for general software dependencies (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | Parameter Settings: We sample 30 times per data point and deduplicate the responses in Step 1. For labeling the PRM training data, we perform 8 rollouts per step using the same model π. This process generates 388k PRM training samples. |