Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

Authors: Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Aviral Kumar, Rishabh Agarwal, Sridhar Thiagarajan, Craig Boutilier, Aleksandra Faust

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate the effectiveness of Bo N-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the Bo32 performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and pass@32 from 60.0% to 67.0%, as well as the pass@16 on Human Eval from 61.6% to 67.1%.
Researcher Affiliation Industry Yinlam Chow , Guy Tennenholtz , Izzeddin Gur , Vincent Zhuang , Bo Dai , Sridhar Thiagarajan , Craig Boutilier , Rishabh Agarwal , Aviral Kumar , Aleksandra Faust Google Deepmind, Google Research
Pseudocode Yes Reproducibility Statement. We utilize the publicly available Gemma 2B and 9B language models, the Hendrycks MATH benchmark, and the Human Eval coding benchmark all accessible to the research community. Our experimental setup is described in detail in Section 5. Furthermore, the appendix provides comprehensive pseudo-code (Algorithms 1 to 4) and implementation details for our Bo N-aware fine-tuning algorithms (Bo N-SFT, Bo N-RL, Bo N-RLB, and Bo N-RLB(P)).
Open Source Code No Reproducibility Statement. We utilize the publicly available Gemma 2B and 9B language models, the Hendrycks MATH benchmark, and the Human Eval coding benchmark all accessible to the research community. Our experimental setup is described in detail in Section 5. Furthermore, the appendix provides comprehensive pseudo-code (Algorithms 1 to 4) and implementation details for our Bo N-aware fine-tuning algorithms (Bo N-SFT, Bo N-RL, Bo N-RLB, and Bo N-RLB(P)).
Open Datasets Yes Our experiments demonstrate the effectiveness of Bo N-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the Bo32 performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and pass@32 from 60.0% to 67.0%, as well as the pass@16 on Human Eval from 61.6% to 67.1%.
Dataset Splits Yes For the MATH benchmark, we trained the Gemma 2B and 9B models with the Hendrycks MATH dataset. Following Lightman et al. (2023), we augment the original 7500 MATH training problems with 4500 problems from the test set, evaluating performance on the remaining 500 problems.
Hardware Specification No The paper mentions training Gemma 2B and 9B models but does not specify the hardware (e.g., GPU models, CPU types) used for these experiments.
Software Dependencies No Table 2 lists 'Optimizer Adam W' but does not provide specific version numbers for software libraries, programming languages (e.g., Python), or frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup Yes Table 2: Hyperparameters used in experiments. Hyperparameter Value Base model Gemma 2b v2 Optimizer Adam W Learning rate policy 3e-6 Policy warmup steps 100 Learning rate value 1e-5 Anchor EMA 0.01 Training steps 2500 Batch size 32 Sampling temperature 1.0 KL coefficient anneal steps 2500 KL coefficient anneal range 1.0 0.075 KL coefficient anneal delay 10 Clipping values for Pfail {0.01, 0.99}