Evaluation of Best-of-N Sampling Strategies for Language Model Alignment

Authors: Yuki Ichihara, Yuu Jinnai, Tetsuro Morimura, Kenshi Abe, Kaito Ariu, Mitsuki Sakamoto, Eiji Uchibe

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then perform an empirical evaluation using the Alpaca Farm and Anthropic s hh-rlhf datasets to evaluate which factors of the regularization strategies contribute to the improvement of the true proxy reward.
Researcher Affiliation Collaboration Yuki Ichihara EMAIL Nara Institute of Science and Technology Yuu Jinnai EMAIL Tetsuro Morimura EMAIL Kenshi Abe EMAIL Kaito Ariu EMAIL Mitsuki Sakamoto EMAIL Cyber Agent Eiji Uchibe EMAIL Advanced Telecommunications Research Institute International
Pseudocode No The paper describes mathematical formulations and theoretical analysis of algorithms (e.g., Section 2, Section 3) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, figures, or sections that detail step-by-step procedures in a code-like format.
Open Source Code No Our code will be available as open source upon acceptance.
Open Datasets Yes Datasets. We conduct experiments using two datasets: the Alpaca Farm dataset (Dubois et al., 2023) and Anthropic s hh-rlhf (HH) dataset, which we use the Harmlessness and Helpfulness subsets (Bai et al., 2022). For the Alpaca Farm dataset, we use the first 1000 entries of the train split (alpaca human preference) as the development set and the 805 entries of the evaluation split (alpaca farm evaluation) for evaluation. For Anthropic s datasets, we separately conduct experiments on the helpful-base (Helpfulness) and harmless-base (Harmlessness). For each dataset, we use the first 1000 entries of the train split as the development set and the first 1000 entries of the evaluation split for evaluation. Table 8: List of datasets and models used in the experiments. Name Reference Alpaca Farm Dubois et al. (2023) https://huggingface.co/datasets/ tatsu-lab/alpaca_farm Anthropic s hh-rlhf Bai et al. (2022) https://huggingface.co/datasets/Anthropic/ hh-rlhf
Dataset Splits Yes For the Alpaca Farm dataset, we use the first 1000 entries of the train split (alpaca human preference) as the development set and the 805 entries of the evaluation split (alpaca farm evaluation) for evaluation. For Anthropic s datasets, we separately conduct experiments on the helpful-base (Helpfulness) and harmless-base (Harmlessness). For each dataset, we use the first 1000 entries of the train split as the development set and the first 1000 entries of the evaluation split for evaluation.
Hardware Specification No The paper mentions the use of various language models (e.g., Mistral 7B SFT β) and reward models, but it does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) on which these models were run or experiments were conducted.
Software Dependencies No The paper mentions specific models like 'Mistral 7B SFT β' and 'all-mpnet-base-v2' and techniques like 'nucleus sampling', but it does not list any software libraries, frameworks, or tools with their specific version numbers (e.g., 'PyTorch 1.9', 'Python 3.8').
Experiment Setup Yes We set the maximum entry length and the maximum output length to be 256 tokens. We sample response texts using nucleus sampling (Holtzman et al., 2020) with temperature set to 1.0 and top-p set to 0.9. For each entry, in the Alpaca Farm dataset and Anthropic s datasets, 128 responses are generated using Mistral 7B SFT β. Hyperparameter β range is {1.0 10 4, 2.0 10 4, 5.0 10 4,1.0 10 3,..., 2.0 101}. We first find the optimal beta value β in the train split, then we use the optimal values in the development split for the evaluation split.