Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
LLMs Can Reason Faster Only If We Let Them
Authors: Bilgehan Sel, Lifu Huang, Naren Ramakrishnan, Ruoxi Jia, Ming Jin
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations indicate that Ao T-O3 shortens solution length by up to 80% compared to baseline Ao T while maintaining or surpassing prior performance. Our evaluation encompasses various planning domains, from simple sequential tasks to complex strategic problems requiring multiple levels of reasoning. Results show that Ao T-O3 achieves an 80% reduction in solution length while improving accuracy, an advantage that becomes even more critical under tight token constraints. |
| Researcher Affiliation | Academia | 1Virginia Tech, Blacksburg, USA. 2University of California, Davis, USA. Correspondence to: Bilgehan Sel <EMAIL>. |
| Pseudocode | No | The paper describes the approach using textual explanations and mathematical formulas for the SFT objective and reward model, but it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing code, nor does it include a link to a code repository. |
| Open Datasets | Yes | To investigate the relationship between demonstration solution length and model performance, we conducted experiments using the Open AI GPT-4 model (Achiam et al., 2023) on the Game of 24 benchmark (Yao et al., 2024; Sel et al., 2024b). We utilize the classic 8-puzzle variant of the sliding tile puzzle... We implement the classic word ladder puzzle, where the objective is to transform one word into another by changing a single letter at a time, with each intermediate step forming a valid English word. Our dataset is constructed using the NLTK base words dataset... |
| Dataset Splits | Yes | For each benchmark, we constructed datasets following these specifications: Game of X. Training set: 64,000 examples Test set: 100 examples... N-Puzzle. Training set: 64,000 examples Test set: 100 examples... Word Ladder. Training set: 32,000 examples Test set: 100 examples. |
| Hardware Specification | Yes | All models were trained on 8x NVIDIA H100 GPUs with 80GB memory. |
| Software Dependencies | No | We utilized trl library from huggingface, however, we needed to have a custom RLOO trainer implementation with minor but crucial differences to the original library due to it being designed for LLM reward models. For multi-GPU training, we utilized Deep Speed (Rasley et al., 2020) with a suitable gradient accumulation step of to maintain effective batch sizes while managing memory constraints for the various models we trained. |
| Experiment Setup | Yes | The supervised fine-tuning (SFT) phase consists of 500 training steps with a batch size of 128 and a learning rate of 1e-5. This is followed by the reinforcement learning phase using RLOO (REINFORCE leave-one-out), where we employ a smaller batch size of 32 due to VRAM constraints and a reduced learning rate of 1e-6. During the RL phase, we generate 4 samples per problem to estimate policy gradients while maintaining reasonable computational requirements. Additionally, we set β = 0.5 and κ = 0.2. Further details are given in Appendix B. Optimizer: Adam W Base learning rate: 1e-5 Weight decay: 0.01 Gradient clipping: 0.1 (max norm) Batch size: 128 (effective, after gradient accumulation) Training steps: 500 Warm-up steps: 50 Learning rate scheduler: Cosine annealing... RLOO Implementation Algorithm: REINFORCE leave-one-out Base learning rate: 1e-6 Batch size: 32 Samples per problem: 4 KL penalty coefficient: 0.1 Value loss coefficient: 0.5... Reward Model Parameters Step penalty factor (α): 0.02 Minimum reward cutoff (β): -0.5 Solution path bonus (κ): 0.2... Inference Parameters Temperature: 0.0 Top-p (nucleus sampling): 0.0 Maximum new tokens: 1024 Repetition penalty: No |