Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model

Authors: Wenhong Zhu, Zhiwei He, Xiaofeng Wang, Pengfei Liu, Rui Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that WSPO delivers outstanding performance, improving the win rate of Qwen2-7B-Instruct on Arena-Hard from 39.70 to 49.60, achieving a remarkable 47.04 length-controlled win rate on Alpaca Eval 2, and scoring 7.33 on MT-bench. Our results suggest that using the weak model to elicit a strong model with a high alignment ability is feasible.
Researcher Affiliation Academia Wenhong Zhu1,2, Zhiwei He1, Xiaofeng Wang1, Pengfei Liu1,2, Rui Wang1 1Shanghai Jiao Tong University, 2Shanghai Innovation Institute EMAIL
Pseudocode No The general WSPO pipeline operates as follows: 1.) Utilize offline datasets D = n x(i), y(i) w o N i=1, such as the selected preference or SFT datasets; paired datasets are not required. In Appendix C, we demonstrate that even the rejected preference dataset remains effective for the WSPO algorithm. 2.) Prepare the weak model, both pre-and post-alignment. 3.) Optimize the LM πstrong θ (y | x) to minimize the objective LWSPO for the specified dataset. The only parameter requiring tuning is γ.
Open Source Code Yes The code is available at https://github.com/zwhong714/weak-to-strong-preference-optimization.
Open Datasets Yes We employ the Qwen2-1.5B-base and Qwen2-7B-base models (Yang et al., 2024a) as our pre-trained weak model, πweak base , and strong model, πstrong base , respectively. To train the SFT models, πweak sft and πstrong sft , we utilize the training split of the XSUM dataset (Narayan et al., 2018).
Dataset Splits Yes We utilize the training split of the XSUM dataset (Narayan et al., 2018). Subsequently, the validation split is employed for further fine-tuning, leading to the development of the corresponding PPO-aligned models, πweak ppo and πstrong ppo . When training with WSPO, we directly use the distributional differences between πweak base and πweak ppo to align πstrong base and derive πstrong wspo , because no additional knowledge is required to output a summary in the summary task. Evaluation. The parameters Lmin and Lmax are set to 20 and 30, respectively. We use the test split for evaluation to guarantee no data contamination (Zhu et al., 2023). Detailed experimental settings are provided in Appendix B.1.
Hardware Specification Yes All the training experiments in this paper were conducted on 8 H100 GPUs based on the LLa MAFactory (Zheng et al., 2024b) repo, which provides an integrated approach to fine-tuning over 100 LLMs with a diverse range of efficient fine-tuning techniques. If not specified, the inference engine used by our LMs defaults to vllm (Kwon et al., 2023).
Software Dependencies No All the training experiments in this paper were conducted on 8 H100 GPUs based on the LLa MAFactory (Zheng et al., 2024b) repo, which provides an integrated approach to fine-tuning over 100 LLMs with a diverse range of efficient fine-tuning techniques. If not specified, the inference engine used by our LMs defaults to vllm (Kwon et al., 2023).
Experiment Setup Yes We first fine-tune the base model on the dataset using three epochs in a batch size of 32, yielding our SFT model. Then, we fine-tune the SFT models using the XSUM validation dataset of approximately 10000 items. We train aligned policy models using PPO to maximize the length reward in Equation 8. The batch size equals 8, and we fine-tune about ten epochs.