Evaluation of Best-of-N Sampling Strategies for Language Model Alignment
Authors: Yuki Ichihara, Yuu Jinnai, Tetsuro Morimura, Kenshi Abe, Kaito Ariu, Mitsuki Sakamoto, Eiji Uchibe
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then perform an empirical evaluation using the Alpaca Farm and Anthropic s hh-rlhf datasets to evaluate which factors of the regularization strategies contribute to the improvement of the true proxy reward. |
| Researcher Affiliation | Collaboration | Yuki Ichihara EMAIL Nara Institute of Science and Technology Yuu Jinnai EMAIL Tetsuro Morimura EMAIL Kenshi Abe EMAIL Kaito Ariu EMAIL Mitsuki Sakamoto EMAIL Cyber Agent Eiji Uchibe EMAIL Advanced Telecommunications Research Institute International |
| Pseudocode | No | The paper describes mathematical formulations and theoretical analysis of algorithms (e.g., Section 2, Section 3) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, figures, or sections that detail step-by-step procedures in a code-like format. |
| Open Source Code | No | Our code will be available as open source upon acceptance. |
| Open Datasets | Yes | Datasets. We conduct experiments using two datasets: the Alpaca Farm dataset (Dubois et al., 2023) and Anthropic s hh-rlhf (HH) dataset, which we use the Harmlessness and Helpfulness subsets (Bai et al., 2022). For the Alpaca Farm dataset, we use the first 1000 entries of the train split (alpaca human preference) as the development set and the 805 entries of the evaluation split (alpaca farm evaluation) for evaluation. For Anthropic s datasets, we separately conduct experiments on the helpful-base (Helpfulness) and harmless-base (Harmlessness). For each dataset, we use the first 1000 entries of the train split as the development set and the first 1000 entries of the evaluation split for evaluation. Table 8: List of datasets and models used in the experiments. Name Reference Alpaca Farm Dubois et al. (2023) https://huggingface.co/datasets/ tatsu-lab/alpaca_farm Anthropic s hh-rlhf Bai et al. (2022) https://huggingface.co/datasets/Anthropic/ hh-rlhf |
| Dataset Splits | Yes | For the Alpaca Farm dataset, we use the first 1000 entries of the train split (alpaca human preference) as the development set and the 805 entries of the evaluation split (alpaca farm evaluation) for evaluation. For Anthropic s datasets, we separately conduct experiments on the helpful-base (Helpfulness) and harmless-base (Harmlessness). For each dataset, we use the first 1000 entries of the train split as the development set and the first 1000 entries of the evaluation split for evaluation. |
| Hardware Specification | No | The paper mentions the use of various language models (e.g., Mistral 7B SFT β) and reward models, but it does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) on which these models were run or experiments were conducted. |
| Software Dependencies | No | The paper mentions specific models like 'Mistral 7B SFT β' and 'all-mpnet-base-v2' and techniques like 'nucleus sampling', but it does not list any software libraries, frameworks, or tools with their specific version numbers (e.g., 'PyTorch 1.9', 'Python 3.8'). |
| Experiment Setup | Yes | We set the maximum entry length and the maximum output length to be 256 tokens. We sample response texts using nucleus sampling (Holtzman et al., 2020) with temperature set to 1.0 and top-p set to 0.9. For each entry, in the Alpaca Farm dataset and Anthropic s datasets, 128 responses are generated using Mistral 7B SFT β. Hyperparameter β range is {1.0 10 4, 2.0 10 4, 5.0 10 4,1.0 10 3,..., 2.0 101}. We first find the optimal beta value β in the train split, then we use the optimal values in the development split for the evaluation split. |