SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters

Authors: Teng Xiao, Yige Yuan, Zhengyu Chen, Mingxiao Li, Shangsong Liang, Zhaochun Ren, Vasant Honavar

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on widely used real-world benchmarks, including MTBench, Alpaca Eval 2, and 10 key benchmarks of the Open LLM Leaderboard with 5 base models, demonstrate that Sim PER consistently and significantly outperforms existing approaches even without any hyperparameters or a reference model.
Researcher Affiliation Collaboration 1Pennsylvania State University 2University of Chinese Academy of Sciences 3Meituan Inc 4Tencent AI Lab 5Sun Yat-Sen University 6Leiden University
Pseudocode No The paper describes its methods mathematically and textually in Section 3, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The source code for Sim PER is publicly available at the Github: https://github.com/tengxiao1/SimPER.
Open Datasets Yes For the Llama3-8B-Base and Mistral-7B-Base setups, we follow the same training pipeline as Zephyr (Tunstall et al., 2023) and evaluate Sim PER on the widely used benchmark dataset for preference fine-tuning: the Ultra Feedback Binarized dataset (Cui et al., 2023; Tunstall et al., 2023). For Pythia-2.8B, Anthropic-HH dataset (Bai et al., 2022) is used for dialogue generation to produce helpful and harmless responses (Rafailov et al., 2024).
Dataset Splits Yes For the Llama3-8B-Base and Mistral-7B-Base setups, we follow the same training pipeline as Zephyr (Tunstall et al., 2023)... For the Llama3-8B-Instruct and Mistral-7B-Instruct setups, we evaluate using the on-policy datasets generated by Sim PO (Meng et al., 2024)... For Pythia-2.8B, Anthropic-HH dataset (Bai et al., 2022) is used... For evaluation on Anthropic-HH, we use GPT-4 for zero-shot pair-wise evaluation, consistent with human judgments (see prompts in Appendix B.1). The Anthropic HH test set is used as the evaluation dataset.
Hardware Specification Yes All the training experiments in this paper were conducted on 4 NVIDIA A100 (80G) GPUs with 128 batchsize, based on the alignment-handbook repo.
Software Dependencies No The paper mentions the use of an "Adam optimizer (Kingma, 2014)" but does not specify version numbers for other key software components like programming languages or libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes For general hyperparameters, we adhered strictly to the settings used in Sim PO. We applied the following hyperparameters: For the SFT stage, we use a learning rate of 2e-5. For both the SFT and the preference optimization stages, we use a batch size of 128, a maximum sequence length of 2048, and a cosine learning rate schedule with 10% warmup steps for one epoch, all through the Adam optimizer (Kingma, 2014). We maintain these settings consistently to ensure uniformity and comparability across experiments. For method-specific hyperparameters... the search strategy is in Table 6. Each method is individually search for the learning rates in [3e-7, 5e-7, 6e-7, 1e-6].