Learning LLM-as-a-Judge for Preference Alignment
Authors: Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, Yiqun LIU
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that Con-J outperforms the scalar reward model trained on the same collection of preference data, and outperforms a series of open-source and closed-source generative LLMs. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science and Technology, Tsinghua University 2Baichuan AI 3University of Copenhagen |
| Pseudocode | Yes | Algorithm 1 Constructing contrastive judgment pairs for Con-J |
| Open Source Code | Yes | We open-source the training process and model weights of Con-J at https://github.com/Ye Ziyi1998/Con-J. |
| Open Datasets | Yes | In addition to the self-built datasets, we train an open-source version of Con-J on a publicly available dataset Skywork-Reward-Preference-80K-v0.13 and test its performance on public benchmarks including Infinity-Preference4, Ultra Feedback (Cui et al., 2023) (select its test set according to Hugging Face H4 5), PKU-Safe RLHF (Ji et al., 2024), and Reward-Bench (Lambert et al., 2024). |
| Dataset Splits | Yes | We ensure that no identical prompts appear in both the training and test sets by filtering them out of the training set. ... Ultra Feedback (Cui et al., 2023) (select its test set according to Hugging Face H4 5) |
| Hardware Specification | No | No specific hardware details such as GPU/CPU models, processors, or memory amounts are provided for running the experiments. The text mentions software libraries and computation precision but not the underlying hardware. |
| Software Dependencies | Yes | All the experiments in this paper are carried out based on open-source frameworks, including Open-RLHF (Hu et al., 2024), Pytorch, and Transformers 7. |
| Experiment Setup | Yes | The maximum sequence length is set as 4,096. The batch sizes for SM and Con-J are set to 128 and 24, respectively, while their peak learning rates are set to 9e-6 and 5e-7, respectively, in accordance with existing practices for 7B models. For Con-J, we linearly combine the SFT loss and the DPO loss with α = 1e-6. |