JudgeBench: A Benchmark for Evaluating LLM-Based Judges
Authors: Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Tang, Alejandro Cuadron, Chenguang Wang, Raluca Popa, Ion Stoica
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that Judge Bench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, Judge Bench offers a reliable platform for assessing increasingly advanced LLM-based judges. |
| Researcher Affiliation | Academia | 1UC Berkeley, 2Washington University in St. Louis |
| Pseudocode | No | No pseudocode or algorithm blocks are explicitly presented in the paper. Figure 2 provides an 'Overview of Judge Bench Pipeline' as a diagram, not a pseudocode block, and Appendix A.6 provides prompts, not structured algorithms. |
| Open Source Code | Yes | Data and code are available at https://github.com/Scaler Lab/Judge Bench. |
| Open Datasets | Yes | MMLU-Pro (Wang et al., 2024a). We use MMLU-Pro for the Knowledge category. MMLU-Pro is a challenging multi-task dataset, filtered from the original MMLU dataset (Hendrycks et al., 2020). [...] Live Bench (White et al., 2024). Live Bench offers datasets in categories such as reasoning, mathematics, and instruction-following, and releases new data monthly to avoid contamination. [...] Live Code Bench (Jain et al., 2024). Live Code Bench is a contamination-free dataset for coding tasks, containing over 300 challenging questions sourced from coding contests like Leet Code, At Coder, and Codeforces. We select this dataset for the Coding category. |
| Dataset Splits | No | The paper describes the composition of the Judge Bench dataset used for evaluation, stating 'our dataset consists of a total of 350 questions: 154 in Knowledge, 98 in Reasoning, 56 in Mathematics, and 42 in Coding.' However, it does not explicitly define training, validation, and test splits for the experimental setup or for the models being evaluated, beyond implicitly using the entire Judge Bench dataset as a test set. |
| Hardware Specification | No | The paper states that 'All open-weight LLMs (including reward models) were served locally in half-precision, except for Llama-3.1-405B-Instruct for which we utilized the Together API.' However, it does not provide specific details on the hardware models (e.g., GPU/CPU types) used for these local deployments or for the experiments. |
| Software Dependencies | No | The paper mentions serving open-weight LLMs locally and using the Together API, but it does not specify any particular software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed to replicate the experimental environment. |
| Experiment Setup | Yes | We closely followed the official implementation of each judge, only making modifications where necessary. One broad change we made across all judges is the use of greedy decoding (temperature=0) to ensure reproducibility. Any additional judge-specific modifications are detailed below. [...] Each judgment must be no more than 1024 tokens [...] Each judgment must be less than 4096 tokens; however, if we were unable to extract the verdict using regex (e.g., if the judgment was incomplete after 4096 tokens), the judge was given one more opportunity to continue its judgment (up to 4096 additional tokens) and output a valid verdict. [...] we truncated both candidate responses (from the left) to fit the request in the limited context window of 2048 tokens. |