Learning LLM-as-a-Judge for Preference Alignment

Authors: Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, Yiqun LIU

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that Con-J outperforms the scalar reward model trained on the same collection of preference data, and outperforms a series of open-source and closed-source generative LLMs.
Researcher Affiliation Collaboration 1Department of Computer Science and Technology, Tsinghua University 2Baichuan AI 3University of Copenhagen
Pseudocode Yes Algorithm 1 Constructing contrastive judgment pairs for Con-J
Open Source Code Yes We open-source the training process and model weights of Con-J at https://github.com/Ye Ziyi1998/Con-J.
Open Datasets Yes In addition to the self-built datasets, we train an open-source version of Con-J on a publicly available dataset Skywork-Reward-Preference-80K-v0.13 and test its performance on public benchmarks including Infinity-Preference4, Ultra Feedback (Cui et al., 2023) (select its test set according to Hugging Face H4 5), PKU-Safe RLHF (Ji et al., 2024), and Reward-Bench (Lambert et al., 2024).
Dataset Splits Yes We ensure that no identical prompts appear in both the training and test sets by filtering them out of the training set. ... Ultra Feedback (Cui et al., 2023) (select its test set according to Hugging Face H4 5)
Hardware Specification No No specific hardware details such as GPU/CPU models, processors, or memory amounts are provided for running the experiments. The text mentions software libraries and computation precision but not the underlying hardware.
Software Dependencies Yes All the experiments in this paper are carried out based on open-source frameworks, including Open-RLHF (Hu et al., 2024), Pytorch, and Transformers 7.
Experiment Setup Yes The maximum sequence length is set as 4,096. The batch sizes for SM and Con-J are set to 128 and 24, respectively, while their peak learning rates are set to 9e-6 and 5e-7, respectively, in accordance with existing practices for 7B models. For Con-J, we linearly combine the SFT loss and the DPO loss with α = 1e-6.