Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

LLMs Can Reason Faster Only If We Let Them

Authors: Bilgehan Sel, Lifu Huang, Naren Ramakrishnan, Ruoxi Jia, Ming Jin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations indicate that Ao T-O3 shortens solution length by up to 80% compared to baseline Ao T while maintaining or surpassing prior performance. Our evaluation encompasses various planning domains, from simple sequential tasks to complex strategic problems requiring multiple levels of reasoning. Results show that Ao T-O3 achieves an 80% reduction in solution length while improving accuracy, an advantage that becomes even more critical under tight token constraints.
Researcher Affiliation Academia 1Virginia Tech, Blacksburg, USA. 2University of California, Davis, USA. Correspondence to: Bilgehan Sel <EMAIL>.
Pseudocode No The paper describes the approach using textual explanations and mathematical formulas for the SFT objective and reward model, but it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps.
Open Source Code No The paper does not provide an explicit statement about releasing code, nor does it include a link to a code repository.
Open Datasets Yes To investigate the relationship between demonstration solution length and model performance, we conducted experiments using the Open AI GPT-4 model (Achiam et al., 2023) on the Game of 24 benchmark (Yao et al., 2024; Sel et al., 2024b). We utilize the classic 8-puzzle variant of the sliding tile puzzle... We implement the classic word ladder puzzle, where the objective is to transform one word into another by changing a single letter at a time, with each intermediate step forming a valid English word. Our dataset is constructed using the NLTK base words dataset...
Dataset Splits Yes For each benchmark, we constructed datasets following these specifications: Game of X. Training set: 64,000 examples Test set: 100 examples... N-Puzzle. Training set: 64,000 examples Test set: 100 examples... Word Ladder. Training set: 32,000 examples Test set: 100 examples.
Hardware Specification Yes All models were trained on 8x NVIDIA H100 GPUs with 80GB memory.
Software Dependencies No We utilized trl library from huggingface, however, we needed to have a custom RLOO trainer implementation with minor but crucial differences to the original library due to it being designed for LLM reward models. For multi-GPU training, we utilized Deep Speed (Rasley et al., 2020) with a suitable gradient accumulation step of to maintain effective batch sizes while managing memory constraints for the various models we trained.
Experiment Setup Yes The supervised fine-tuning (SFT) phase consists of 500 training steps with a batch size of 128 and a learning rate of 1e-5. This is followed by the reinforcement learning phase using RLOO (REINFORCE leave-one-out), where we employ a smaller batch size of 32 due to VRAM constraints and a reduced learning rate of 1e-6. During the RL phase, we generate 4 samples per problem to estimate policy gradients while maintaining reasonable computational requirements. Additionally, we set β = 0.5 and κ = 0.2. Further details are given in Appendix B. Optimizer: Adam W Base learning rate: 1e-5 Weight decay: 0.01 Gradient clipping: 0.1 (max norm) Batch size: 128 (effective, after gradient accumulation) Training steps: 500 Warm-up steps: 50 Learning rate scheduler: Cosine annealing... RLOO Implementation Algorithm: REINFORCE leave-one-out Base learning rate: 1e-6 Batch size: 32 Samples per problem: 4 KL penalty coefficient: 0.1 Value loss coefficient: 0.5... Reward Model Parameters Step penalty factor (α): 0.02 Minimum reward cutoff (β): -0.5 Solution path bonus (κ): 0.2... Inference Parameters Temperature: 0.0 Top-p (nucleus sampling): 0.0 Maximum new tokens: 1024 Repetition penalty: No