reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

Authors: Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Aviral Kumar, Rishabh Agarwal, Sridhar Thiagarajan, Craig Boutilier, Aleksandra Faust

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate the effectiveness of Bo N-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the Bo32 performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and pass@32 from 60.0% to 67.0%, as well as the pass@16 on Human Eval from 61.6% to 67.1%.
Researcher Affiliation	Industry	Yinlam Chow , Guy Tennenholtz , Izzeddin Gur , Vincent Zhuang , Bo Dai , Sridhar Thiagarajan , Craig Boutilier , Rishabh Agarwal , Aviral Kumar , Aleksandra Faust Google Deepmind, Google Research
Pseudocode	Yes	Reproducibility Statement. We utilize the publicly available Gemma 2B and 9B language models, the Hendrycks MATH benchmark, and the Human Eval coding benchmark all accessible to the research community. Our experimental setup is described in detail in Section 5. Furthermore, the appendix provides comprehensive pseudo-code (Algorithms 1 to 4) and implementation details for our Bo N-aware fine-tuning algorithms (Bo N-SFT, Bo N-RL, Bo N-RLB, and Bo N-RLB(P)).
Open Source Code	No	Reproducibility Statement. We utilize the publicly available Gemma 2B and 9B language models, the Hendrycks MATH benchmark, and the Human Eval coding benchmark all accessible to the research community. Our experimental setup is described in detail in Section 5. Furthermore, the appendix provides comprehensive pseudo-code (Algorithms 1 to 4) and implementation details for our Bo N-aware fine-tuning algorithms (Bo N-SFT, Bo N-RL, Bo N-RLB, and Bo N-RLB(P)).
Open Datasets	Yes	Our experiments demonstrate the effectiveness of Bo N-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the Bo32 performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and pass@32 from 60.0% to 67.0%, as well as the pass@16 on Human Eval from 61.6% to 67.1%.
Dataset Splits	Yes	For the MATH benchmark, we trained the Gemma 2B and 9B models with the Hendrycks MATH dataset. Following Lightman et al. (2023), we augment the original 7500 MATH training problems with 4500 problems from the test set, evaluating performance on the remaining 500 problems.
Hardware Specification	No	The paper mentions training Gemma 2B and 9B models but does not specify the hardware (e.g., GPU models, CPU types) used for these experiments.
Software Dependencies	No	Table 2 lists 'Optimizer Adam W' but does not provide specific version numbers for software libraries, programming languages (e.g., Python), or frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup	Yes	Table 2: Hyperparameters used in experiments. Hyperparameter Value Base model Gemma 2b v2 Optimizer Adam W Learning rate policy 3e-6 Policy warmup steps 100 Learning rate value 1e-5 Anchor EMA 0.01 Training steps 2500 Batch size 32 Sampling temperature 1.0 KL coefficient anneal steps 2500 KL coefficient anneal range 1.0 0.075 KL coefficient anneal delay 10 Clipping values for Pfail {0.01, 0.99}