Self-Boosting Large Language Models with Synthetic Preference Data

Authors: Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, Furu Wei

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that Syn PO not only benefits LLM alignment with human preferences, but also improves generalist capabilities across various tasks. Trained solely on synthetic data, Syn PO significantly improves the instruction-following abilities of Llama3-8B and Mistral-7B (as shown in Figure 1 and Table 1), achieving over a 26% length-controlled win rate improvement on Alpaca Eval 2.0 (Dubois et al., 2024) and a 22% to 30% improvement on Arena-hard (Li et al., 2024c) (as shown in Table 2). Furthermore, self-boosted models achieve 3.2% to 5.0% higher average performance than SFT models on the Open LLM leaderboard (Beeching et al., 2023), indicating Syn PO also enhances general LLM performance.
Researcher Affiliation Collaboration Qingxiu Dong Li Dong Xingxing Zhang Zhifang Sui Furu Wei State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University Microsoft Research
Pseudocode Yes A ALGORITHM We provide the overall pipeline of Syn PO in Algorithm 1. Algorithm 1 Synthetic Preference Optimization(Syn PO)
Open Source Code No The paper mentions specific model repositories and baseline code repositories (e.g., "https://huggingface.co/alignment-handbook/zephyr-7b-sft-full", "https://huggingface.co/RLHFlow/Armo RM-Llama3-8B-v0.1", "https://github.com/ princeton-nlp/Sim PO") for models or baselines used in their experiments, but it does not provide an explicit statement or a direct link to the source code for the Syn PO methodology itself.
Open Datasets Yes We randomly sample Ultra Feedback (Cui et al., 2023) prompts and their GPT-4 Turbo completions as our seed data. The complete Ultra Feedback dataset contains 61k instructions from sources including Truthful QA (Lin et al., 2021), False QA (Hu et al., 2023), Evol-Instruct (Xu et al., 2023a), Ultra Chat (Ding et al., 2023), and Share GPT (Chiang et al., 2023). ... Alpaca Eval 2.0 (Dubois et al., 2024), Arena-Hard (Li et al., 2024c), and MT-Bench (Zheng et al., 2024). ... Open LLM Leaderboard (Beeching et al., 2023) and 6 additional benchmarks from Language Model Evaluation Harness library (LLM Harness) (Gao et al., 2024).
Dataset Splits No The paper states: "we utilize 18k seed data to SFT the self-prompt generator and then generate 50k synthetic prompts per iteration." and "The seed data is multipurposely transformed for the training of self-prompt generator, response improver, and the validation of synthetic preference optimization." However, it does not explicitly provide specific training/test/validation splits (e.g., percentages or sample counts) for its internal datasets (seed data or synthetic preference data) used for training the Syn PO models. While it uses external benchmarks like Alpaca Eval 2.0 or MT-Bench, which have their own evaluation setups, the paper does not detail the splits for its own generative and iterative training process.
Hardware Specification No The paper mentions using "vllm for inference" but does not specify any particular GPU models (e.g., NVIDIA A100), CPU models, or other hardware components used for running the experiments.
Software Dependencies No The paper mentions several software components and models such as "Mistral-Base 7B (mistralai/Mistral-7B-v0.1)", "Llama38B Base model (meta-llama/Meta-Llama-3-8B-Base)", "Zephyr (Tunstall et al., 2023) training pipeline", "0.4B Pair RM (Jiang et al., 2023)", "Armo RM-Llama3-8B-v0.1", "Sim PO (Meng et al., 2024)", and "vllm for inference". However, it does not provide specific version numbers for general software dependencies like Python, PyTorch, or the vllm library itself, which are necessary for full reproducibility.
Experiment Setup Yes The hyperparameters for self-prompt generator training are detailed below. During SFT for the self-prompt generator, we employ a learning rate of 1.0 10 6 for Mistral-Base and Llama3-Base, with a batch size of 32, a warm-up ratio of 0.1, and an Adam W optimizer. We set the maximum sequence length to 8,000 and train the model for 3 epochs. ... In the Mistral-Base setting, we set the Pair RM scoring threshold to 0.20. In the Llama3-Base setting, the Armo RM-Llama3-8B-v0.1 scoring threshold is set to 0.02. ... As the parameter β is crucial for achieving optimal performance in Sim PO (Meng et al., 2024), we individually search the β in the range of [2, 4, 6, 8, 10, 12] for each optimization process. We use a fixed γ = 1.6 for the Mistral-Base model and Llama3-Base. ... For the Alpaca Eval 2 (Dubois et al., 2024) evaluation, we use a sampling-based decoding approach to generate responses. Specifically, we employ vllm for inference, setting the temperature to 0.7 and the maximum tokens to 2048 for both the Mistral-Base and Llama3-Base configurations.