reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters

Authors: Teng Xiao, Yige Yuan, Zhengyu Chen, Mingxiao Li, Shangsong Liang, Zhaochun Ren, Vasant Honavar

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on widely used real-world benchmarks, including MTBench, Alpaca Eval 2, and 10 key benchmarks of the Open LLM Leaderboard with 5 base models, demonstrate that Sim PER consistently and significantly outperforms existing approaches even without any hyperparameters or a reference model.
Researcher Affiliation	Collaboration	1Pennsylvania State University 2University of Chinese Academy of Sciences 3Meituan Inc 4Tencent AI Lab 5Sun Yat-Sen University 6Leiden University
Pseudocode	No	The paper describes its methods mathematically and textually in Section 3, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The source code for Sim PER is publicly available at the Github: https://github.com/tengxiao1/SimPER.
Open Datasets	Yes	For the Llama3-8B-Base and Mistral-7B-Base setups, we follow the same training pipeline as Zephyr (Tunstall et al., 2023) and evaluate Sim PER on the widely used benchmark dataset for preference fine-tuning: the Ultra Feedback Binarized dataset (Cui et al., 2023; Tunstall et al., 2023). For Pythia-2.8B, Anthropic-HH dataset (Bai et al., 2022) is used for dialogue generation to produce helpful and harmless responses (Rafailov et al., 2024).
Dataset Splits	Yes	For the Llama3-8B-Base and Mistral-7B-Base setups, we follow the same training pipeline as Zephyr (Tunstall et al., 2023)... For the Llama3-8B-Instruct and Mistral-7B-Instruct setups, we evaluate using the on-policy datasets generated by Sim PO (Meng et al., 2024)... For Pythia-2.8B, Anthropic-HH dataset (Bai et al., 2022) is used... For evaluation on Anthropic-HH, we use GPT-4 for zero-shot pair-wise evaluation, consistent with human judgments (see prompts in Appendix B.1). The Anthropic HH test set is used as the evaluation dataset.
Hardware Specification	Yes	All the training experiments in this paper were conducted on 4 NVIDIA A100 (80G) GPUs with 128 batchsize, based on the alignment-handbook repo.
Software Dependencies	No	The paper mentions the use of an "Adam optimizer (Kingma, 2014)" but does not specify version numbers for other key software components like programming languages or libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	For general hyperparameters, we adhered strictly to the settings used in Sim PO. We applied the following hyperparameters: For the SFT stage, we use a learning rate of 2e-5. For both the SFT and the preference optimization stages, we use a batch size of 128, a maximum sequence length of 2048, and a cosine learning rate schedule with 10% warmup steps for one epoch, all through the Adam optimizer (Kingma, 2014). We maintain these settings consistently to ensure uniformity and comparability across experiments. For method-specific hyperparameters... the search strategy is in Table 6. Each method is individually search for the learning rates in [3e-7, 5e-7, 6e-7, 1e-6].