Weighted-Reward Preference Optimization for Implicit Model Fusion
Authors: Ziyi Yang, Fanqi Wan, Longguang Zhong, Tianyuan Shi, Xiaojun Quan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the MT-Bench, Alpaca Eval-2, and Arena-Hard benchmarks demonstrate that WRPO consistently outperforms existing knowledge fusion methods and various fine-tuning baselines. |
| Researcher Affiliation | Academia | Ziyi Yang Fanqi Wan Longguang Zhong Tianyuan Shi Xiaojun Quan School of Computer Science and Engineering, Sun Yat-sen University, China EMAIL, EMAIL |
| Pseudocode | No | The paper describes the WRPO method using mathematical equations and textual descriptions, but it does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/SLIT-AI/WRPO. |
| Open Datasets | Yes | Following prior work (Meng et al., 2024; Zhou et al., 2024), we chose Ultra Feedback (Cui et al., 2024) to construct our training dataset. ... Our experiments are conducted on three widely-used instruction-following benchmarks, namely, MT-Bench (Zheng et al., 2023), Alpaca Eval-2 (Li et al., 2023), and Arena-Hard (Li et al., 2024). |
| Dataset Splits | Yes | The training process is divided into two stages. In the first stage, we applied supervised fine-tuning (SFT) on the set of yws with one-third of the dataset... In the next stage, the remaining dataset is used for preference optimization... |
| Hardware Specification | Yes | We conducted experiments with a batch size of 128 and a maximum length of 2048 tokens on 8x80GB NVIDIA A800 GPUs. |
| Software Dependencies | No | The paper mentions several models and tools like 'Armo RM-Llama-3-8B-v0.1', 'GPT-4-0125-Preview', and 'lm-evaluation-harness tool', but it does not provide specific version numbers for any key software components or libraries used for implementation. |
| Experiment Setup | Yes | We conducted experiments with a batch size of 128 and a maximum length of 2048 tokens... The training was performed on a single epoch for our method. A cosine learning rate schedule with a warmup ratio of 0.1 is employed... with the learning rate empirically set to 7e-6. ... For WRPO, we used a learning rate of 3e-7 and set β = 0.01, with the weight α assigned to yws linearly increasing from 0 to 0.1. |