reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning LLM-as-a-Judge for Preference Alignment

Authors: Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, Yiqun LIU

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that Con-J outperforms the scalar reward model trained on the same collection of preference data, and outperforms a series of open-source and closed-source generative LLMs.
Researcher Affiliation	Collaboration	1Department of Computer Science and Technology, Tsinghua University 2Baichuan AI 3University of Copenhagen
Pseudocode	Yes	Algorithm 1 Constructing contrastive judgment pairs for Con-J
Open Source Code	Yes	We open-source the training process and model weights of Con-J at https://github.com/Ye Ziyi1998/Con-J.
Open Datasets	Yes	In addition to the self-built datasets, we train an open-source version of Con-J on a publicly available dataset Skywork-Reward-Preference-80K-v0.13 and test its performance on public benchmarks including Infinity-Preference4, Ultra Feedback (Cui et al., 2023) (select its test set according to Hugging Face H4 5), PKU-Safe RLHF (Ji et al., 2024), and Reward-Bench (Lambert et al., 2024).
Dataset Splits	Yes	We ensure that no identical prompts appear in both the training and test sets by filtering them out of the training set. ... Ultra Feedback (Cui et al., 2023) (select its test set according to Hugging Face H4 5)
Hardware Specification	No	No specific hardware details such as GPU/CPU models, processors, or memory amounts are provided for running the experiments. The text mentions software libraries and computation precision but not the underlying hardware.
Software Dependencies	Yes	All the experiments in this paper are carried out based on open-source frameworks, including Open-RLHF (Hu et al., 2024), Pytorch, and Transformers 7.
Experiment Setup	Yes	The maximum sequence length is set as 4,096. The batch sizes for SM and Con-J are set to 128 and 24, respectively, while their peak learning rates are set to 9e-6 and 5e-7, respectively, in accordance with existing practices for 7B models. For Con-J, we linearly combine the SFT loss and the DPO loss with α = 1e-6.