reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Weighted-Reward Preference Optimization for Implicit Model Fusion

Authors: Ziyi Yang, Fanqi Wan, Longguang Zhong, Tianyuan Shi, Xiaojun Quan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on the MT-Bench, Alpaca Eval-2, and Arena-Hard benchmarks demonstrate that WRPO consistently outperforms existing knowledge fusion methods and various fine-tuning baselines.
Researcher Affiliation	Academia	Ziyi Yang Fanqi Wan Longguang Zhong Tianyuan Shi Xiaojun Quan School of Computer Science and Engineering, Sun Yat-sen University, China EMAIL, EMAIL
Pseudocode	No	The paper describes the WRPO method using mathematical equations and textual descriptions, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/SLIT-AI/WRPO.
Open Datasets	Yes	Following prior work (Meng et al., 2024; Zhou et al., 2024), we chose Ultra Feedback (Cui et al., 2024) to construct our training dataset. ... Our experiments are conducted on three widely-used instruction-following benchmarks, namely, MT-Bench (Zheng et al., 2023), Alpaca Eval-2 (Li et al., 2023), and Arena-Hard (Li et al., 2024).
Dataset Splits	Yes	The training process is divided into two stages. In the first stage, we applied supervised fine-tuning (SFT) on the set of yws with one-third of the dataset... In the next stage, the remaining dataset is used for preference optimization...
Hardware Specification	Yes	We conducted experiments with a batch size of 128 and a maximum length of 2048 tokens on 8x80GB NVIDIA A800 GPUs.
Software Dependencies	No	The paper mentions several models and tools like 'Armo RM-Llama-3-8B-v0.1', 'GPT-4-0125-Preview', and 'lm-evaluation-harness tool', but it does not provide specific version numbers for any key software components or libraries used for implementation.
Experiment Setup	Yes	We conducted experiments with a batch size of 128 and a maximum length of 2048 tokens... The training was performed on a single epoch for our method. A cosine learning rate schedule with a warmup ratio of 0.1 is employed... with the learning rate empirically set to 7e-6. ... For WRPO, we used a learning rate of 3e-7 and set β = 0.01, with the weight α assigned to yws linearly increasing from 0 to 0.1.