reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Evaluation of Best-of-N Sampling Strategies for Language Model Alignment

Authors: Yuki Ichihara, Yuu Jinnai, Tetsuro Morimura, Kenshi Abe, Kaito Ariu, Mitsuki Sakamoto, Eiji Uchibe

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We then perform an empirical evaluation using the Alpaca Farm and Anthropic s hh-rlhf datasets to evaluate which factors of the regularization strategies contribute to the improvement of the true proxy reward.
Researcher Affiliation	Collaboration	Yuki Ichihara EMAIL Nara Institute of Science and Technology Yuu Jinnai EMAIL Tetsuro Morimura EMAIL Kenshi Abe EMAIL Kaito Ariu EMAIL Mitsuki Sakamoto EMAIL Cyber Agent Eiji Uchibe EMAIL Advanced Telecommunications Research Institute International
Pseudocode	No	The paper describes mathematical formulations and theoretical analysis of algorithms (e.g., Section 2, Section 3) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, figures, or sections that detail step-by-step procedures in a code-like format.
Open Source Code	No	Our code will be available as open source upon acceptance.
Open Datasets	Yes	Datasets. We conduct experiments using two datasets: the Alpaca Farm dataset (Dubois et al., 2023) and Anthropic s hh-rlhf (HH) dataset, which we use the Harmlessness and Helpfulness subsets (Bai et al., 2022). For the Alpaca Farm dataset, we use the first 1000 entries of the train split (alpaca human preference) as the development set and the 805 entries of the evaluation split (alpaca farm evaluation) for evaluation. For Anthropic s datasets, we separately conduct experiments on the helpful-base (Helpfulness) and harmless-base (Harmlessness). For each dataset, we use the first 1000 entries of the train split as the development set and the first 1000 entries of the evaluation split for evaluation. Table 8: List of datasets and models used in the experiments. Name Reference Alpaca Farm Dubois et al. (2023) https://huggingface.co/datasets/ tatsu-lab/alpaca_farm Anthropic s hh-rlhf Bai et al. (2022) https://huggingface.co/datasets/Anthropic/ hh-rlhf
Dataset Splits	Yes	For the Alpaca Farm dataset, we use the first 1000 entries of the train split (alpaca human preference) as the development set and the 805 entries of the evaluation split (alpaca farm evaluation) for evaluation. For Anthropic s datasets, we separately conduct experiments on the helpful-base (Helpfulness) and harmless-base (Harmlessness). For each dataset, we use the first 1000 entries of the train split as the development set and the first 1000 entries of the evaluation split for evaluation.
Hardware Specification	No	The paper mentions the use of various language models (e.g., Mistral 7B SFT β) and reward models, but it does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) on which these models were run or experiments were conducted.
Software Dependencies	No	The paper mentions specific models like 'Mistral 7B SFT β' and 'all-mpnet-base-v2' and techniques like 'nucleus sampling', but it does not list any software libraries, frameworks, or tools with their specific version numbers (e.g., 'PyTorch 1.9', 'Python 3.8').
Experiment Setup	Yes	We set the maximum entry length and the maximum output length to be 256 tokens. We sample response texts using nucleus sampling (Holtzman et al., 2020) with temperature set to 1.0 and top-p set to 0.9. For each entry, in the Alpaca Farm dataset and Anthropic s datasets, 128 responses are generated using Mistral 7B SFT β. Hyperparameter β range is {1.0 10 4, 2.0 10 4, 5.0 10 4,1.0 10 3,..., 2.0 101}. We first find the optimal beta value β in the train split, then we use the optimal values in the development split for the evaluation split.