JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Authors: Lianghui Zhu, Xinggang Wang, Xinlong Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train Judge LM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. Judge LM obtains the state-of-the-art judge performance on both the existing Panda LM benchmark and our proposed new benchmark.
Researcher Affiliation Collaboration 1 School of EIC, Huazhong University of Science & Technology 2 Beijing Academy of Artificial Intelligence
Pseudocode No The paper describes methods such as swap augmentation, reference support, and reference drop, and illustrates the overall process in Figure 1, but does not present structured pseudocode or algorithm blocks.
Open Source Code Yes Code & Models: https://github.com/baaivision/Judge LM
Open Datasets Yes We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges... We sample 105K instruction seed tasks from a large-scale set that contains Alpaca-GPT4 (Peng et al., 2023), Dolly-15K (Conover et al., 2023), GPT4All-LAION (Anand et al., 2023), and Share GPT.
Dataset Splits Yes The training set contains 100K judge samples, while the validation set has 5K.
Hardware Specification Yes Our Judge LM is efficient and the Judge LM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs.
Software Dependencies No Table 11 lists fine-tuning settings including optimizer details (AdamW, ZeRO optimizer) and GPT-3.5 and GPT-4 versions (2023-03-15-preview), but it does not specify software library versions for components like Python, PyTorch, or TensorFlow which would be needed to replicate the experiment.
Experiment Setup Yes Table 11 provides detailed fine-tuning settings for Judge LM, including 'model max length 2048', 'learning rate 2e-5', 'learning rate schedule cosine decay', 'optimizer Adam W', 'optimizer hyper-parameters β1, β2, ϵ = 0.9, 0.999, 1e-8', 'weight decay 0.0', 'batch size 128', 'training epochs 3', 'warmup ratio 0.003', 'numerical precision bf16, tf32', and 'gradient checkpointing True'.