Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization
Authors: Zishun Yu, Tengyu Xu, Di Jin, Karthik Abinav Sankararaman, Yun He, Wenxuan Zhou, Zhouhao Zeng, Eryk Helenowski, Chen Zhu, Sinong Wang, Hao Ma, Han Fang
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations show that our framework enables the model to adhere to constraints and dynamically allocate the inference budget. With different inference budgets, our best models are able to have a 4.14% and 5.74% absolute improvement (8.08% and 11.2% relative) on MATH500 using 2.16x and 4.32x inference budgets respectively, relative to LLa MA3.1 8B Instruct. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, The University of Illinois Chicago, USA 2Meta AI, Menlo Park, USA. |
| Pseudocode | Yes | Algorithm 1 Inference Budget-Constrained PO (IBPO) [...] Algorithm 2 IBPO with Sample Accumulation [...] Algorithm 3 Creating SV Response for SFT [...] Listing 1: Pythonic code snippet for solving Iu B with Sci Py |
| Open Source Code | No | The paper mentions using external tools like 'CPLEX (Cplex, 2009), Gurobi (Gurobi Optimization, LLC, 2024), or Sci Py (Virtanen et al., 2020)', but there is no explicit statement or link indicating that the authors have open-sourced their own code for the methodology described in this paper. |
| Open Datasets | Yes | MATH refers specifically to the training split of the Hendrycks MATH dataset (Hendrycks et al., 2021b), while the 500-sample subset of the testing split is referred to as MATH500 (Lightman et al., 2023). [...] The SDPO dataset was chosen because its ground truth responses follow the SCo T format of LLa MA responses, making it convenient to run supervised fine-tuning (SFT) mixed with LLa MA samples. |
| Dataset Splits | Yes | MATH refers specifically to the training split of the Hendrycks MATH dataset (Hendrycks et al., 2021b), while the 500-sample subset of the testing split is referred to as MATH500 (Lightman et al., 2023). |
| Hardware Specification | Yes | And we conduct our experiments with NVIDIA-A100-80Gs. |
| Software Dependencies | Yes | The OPTIu B problem is a (integer) linear programming problem that could be solved by off-the-shelf solvers, such as CPLEX (Cplex, 2009), Gurobi (Gurobi Optimization, LLC, 2024), or Sci Py (Virtanen et al., 2020) which is our choice. [...] We use Sci Py MILP solver, available here, to solve an integer LP every iter. |
| Experiment Setup | Yes | Table 8: Hyperparameters for Experiment setups 1.2, 2.1, and 2.2 Hyperparameter Setup 1.2 Setup 2.1 Setup 2.2 prompt size 11295 11295 10795 number of nodes 4 4 8 learning rate 1e-6 1e-6 5e-7 batch size (per node) 8 16 4 num of steps 1024 2048 240 optimizer Adam W Adam W Adam W scheduler constant constant constant packing yes yes max sequence length 32768 32768 6144 gradient accumulation 1 2 1 RL-specific params num generation per prompt 8 max generation length 4096 temperature 1.0 top-p 0.9 KL-threashold 1024 batch accumulation kb 4 response accumulation kr 1 |