JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Authors: Lianghui Zhu, Xinggang Wang, Xinlong Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train Judge LM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. Judge LM obtains the state-of-the-art judge performance on both the existing Panda LM benchmark and our proposed new benchmark. |
| Researcher Affiliation | Collaboration | 1 School of EIC, Huazhong University of Science & Technology 2 Beijing Academy of Artificial Intelligence |
| Pseudocode | No | The paper describes methods such as swap augmentation, reference support, and reference drop, and illustrates the overall process in Figure 1, but does not present structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code & Models: https://github.com/baaivision/Judge LM |
| Open Datasets | Yes | We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges... We sample 105K instruction seed tasks from a large-scale set that contains Alpaca-GPT4 (Peng et al., 2023), Dolly-15K (Conover et al., 2023), GPT4All-LAION (Anand et al., 2023), and Share GPT. |
| Dataset Splits | Yes | The training set contains 100K judge samples, while the validation set has 5K. |
| Hardware Specification | Yes | Our Judge LM is efficient and the Judge LM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. |
| Software Dependencies | No | Table 11 lists fine-tuning settings including optimizer details (AdamW, ZeRO optimizer) and GPT-3.5 and GPT-4 versions (2023-03-15-preview), but it does not specify software library versions for components like Python, PyTorch, or TensorFlow which would be needed to replicate the experiment. |
| Experiment Setup | Yes | Table 11 provides detailed fine-tuning settings for Judge LM, including 'model max length 2048', 'learning rate 2e-5', 'learning rate schedule cosine decay', 'optimizer Adam W', 'optimizer hyper-parameters β1, β2, ϵ = 0.9, 0.999, 1e-8', 'weight decay 0.0', 'batch size 128', 'training epochs 3', 'warmup ratio 0.003', 'numerical precision bf16, tf32', and 'gradient checkpointing True'. |