Weighted-Reward Preference Optimization for Implicit Model Fusion

Authors: Ziyi Yang, Fanqi Wan, Longguang Zhong, Tianyuan Shi, Xiaojun Quan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the MT-Bench, Alpaca Eval-2, and Arena-Hard benchmarks demonstrate that WRPO consistently outperforms existing knowledge fusion methods and various fine-tuning baselines.
Researcher Affiliation Academia Ziyi Yang Fanqi Wan Longguang Zhong Tianyuan Shi Xiaojun Quan School of Computer Science and Engineering, Sun Yat-sen University, China EMAIL, EMAIL
Pseudocode No The paper describes the WRPO method using mathematical equations and textual descriptions, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/SLIT-AI/WRPO.
Open Datasets Yes Following prior work (Meng et al., 2024; Zhou et al., 2024), we chose Ultra Feedback (Cui et al., 2024) to construct our training dataset. ... Our experiments are conducted on three widely-used instruction-following benchmarks, namely, MT-Bench (Zheng et al., 2023), Alpaca Eval-2 (Li et al., 2023), and Arena-Hard (Li et al., 2024).
Dataset Splits Yes The training process is divided into two stages. In the first stage, we applied supervised fine-tuning (SFT) on the set of yws with one-third of the dataset... In the next stage, the remaining dataset is used for preference optimization...
Hardware Specification Yes We conducted experiments with a batch size of 128 and a maximum length of 2048 tokens on 8x80GB NVIDIA A800 GPUs.
Software Dependencies No The paper mentions several models and tools like 'Armo RM-Llama-3-8B-v0.1', 'GPT-4-0125-Preview', and 'lm-evaluation-harness tool', but it does not provide specific version numbers for any key software components or libraries used for implementation.
Experiment Setup Yes We conducted experiments with a batch size of 128 and a maximum length of 2048 tokens... The training was performed on a single epoch for our method. A cosine learning rate schedule with a warmup ratio of 0.1 is employed... with the learning rate empirically set to 7e-6. ... For WRPO, we used a learning rate of 3e-7 and set β = 0.01, with the weight α assigned to yws linearly increasing from 0 to 0.1.