reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Authors: Lianghui Zhu, Xinggang Wang, Xinlong Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train Judge LM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. Judge LM obtains the state-of-the-art judge performance on both the existing Panda LM benchmark and our proposed new benchmark.
Researcher Affiliation	Collaboration	1 School of EIC, Huazhong University of Science & Technology 2 Beijing Academy of Artificial Intelligence
Pseudocode	No	The paper describes methods such as swap augmentation, reference support, and reference drop, and illustrates the overall process in Figure 1, but does not present structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code & Models: https://github.com/baaivision/Judge LM
Open Datasets	Yes	We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges... We sample 105K instruction seed tasks from a large-scale set that contains Alpaca-GPT4 (Peng et al., 2023), Dolly-15K (Conover et al., 2023), GPT4All-LAION (Anand et al., 2023), and Share GPT.
Dataset Splits	Yes	The training set contains 100K judge samples, while the validation set has 5K.
Hardware Specification	Yes	Our Judge LM is efficient and the Judge LM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs.
Software Dependencies	No	Table 11 lists fine-tuning settings including optimizer details (AdamW, ZeRO optimizer) and GPT-3.5 and GPT-4 versions (2023-03-15-preview), but it does not specify software library versions for components like Python, PyTorch, or TensorFlow which would be needed to replicate the experiment.
Experiment Setup	Yes	Table 11 provides detailed fine-tuning settings for Judge LM, including 'model max length 2048', 'learning rate 2e-5', 'learning rate schedule cosine decay', 'optimizer Adam W', 'optimizer hyper-parameters β1, β2, ϵ = 0.9, 0.999, 1e-8', 'weight decay 0.0', 'batch size 128', 'training epochs 3', 'warmup ratio 0.003', 'numerical precision bf16, tf32', and 'gradient checkpointing True'.