reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

JudgeBench: A Benchmark for Evaluating LLM-Based Judges

Authors: Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Tang, Alejandro Cuadron, Chenguang Wang, Raluca Popa, Ion Stoica

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that Judge Bench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, Judge Bench offers a reliable platform for assessing increasingly advanced LLM-based judges.
Researcher Affiliation	Academia	1UC Berkeley, 2Washington University in St. Louis
Pseudocode	No	No pseudocode or algorithm blocks are explicitly presented in the paper. Figure 2 provides an 'Overview of Judge Bench Pipeline' as a diagram, not a pseudocode block, and Appendix A.6 provides prompts, not structured algorithms.
Open Source Code	Yes	Data and code are available at https://github.com/Scaler Lab/Judge Bench.
Open Datasets	Yes	MMLU-Pro (Wang et al., 2024a). We use MMLU-Pro for the Knowledge category. MMLU-Pro is a challenging multi-task dataset, filtered from the original MMLU dataset (Hendrycks et al., 2020). [...] Live Bench (White et al., 2024). Live Bench offers datasets in categories such as reasoning, mathematics, and instruction-following, and releases new data monthly to avoid contamination. [...] Live Code Bench (Jain et al., 2024). Live Code Bench is a contamination-free dataset for coding tasks, containing over 300 challenging questions sourced from coding contests like Leet Code, At Coder, and Codeforces. We select this dataset for the Coding category.
Dataset Splits	No	The paper describes the composition of the Judge Bench dataset used for evaluation, stating 'our dataset consists of a total of 350 questions: 154 in Knowledge, 98 in Reasoning, 56 in Mathematics, and 42 in Coding.' However, it does not explicitly define training, validation, and test splits for the experimental setup or for the models being evaluated, beyond implicitly using the entire Judge Bench dataset as a test set.
Hardware Specification	No	The paper states that 'All open-weight LLMs (including reward models) were served locally in half-precision, except for Llama-3.1-405B-Instruct for which we utilized the Together API.' However, it does not provide specific details on the hardware models (e.g., GPU/CPU types) used for these local deployments or for the experiments.
Software Dependencies	No	The paper mentions serving open-weight LLMs locally and using the Together API, but it does not specify any particular software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed to replicate the experimental environment.
Experiment Setup	Yes	We closely followed the official implementation of each judge, only making modifications where necessary. One broad change we made across all judges is the use of greedy decoding (temperature=0) to ensure reproducibility. Any additional judge-specific modifications are detailed below. [...] Each judgment must be no more than 1024 tokens [...] Each judgment must be less than 4096 tokens; however, if we were unable to extract the verdict using regex (e.g., if the judgment was incomplete after 4096 tokens), the judge was given one more opportunity to continue its judgment (up to 4096 additional tokens) and output a valid verdict. [...] we truncated both candidate responses (from the left) to fit the request in the limited context window of 2048 tokens.