Agent-as-a-Judge: Evaluate Agents with Agents
Authors: Mingchen Zhuge, Changsheng Zhao, Dylan R. Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply the Agent-as-a-Judge framework to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present Dev AI, a new benchmark of 55 realistic AI code generation tasks. Dev AI includes rich manual annotations, like a total of 365 hierarchical solution requirements, which make it particularly suitable for an agentic evaluator. We benchmark three of the top code-generating agentic systems using Agent-as-a-Judge and find that our framework dramatically outperforms LLMas-a-Judge and is as reliable as our human evaluation baseline. |
| Researcher Affiliation | Collaboration | 1Meta AI 2KAUST. |
| Pseudocode | No | The paper includes pipeline diagrams (Figure 8) and descriptions of modular components, but it does not contain explicit pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | To help that, our dataset and the full implementation of Agent-as-a-Judge will be publically available at https://github.com/metau to-ai/agent-as-a-judge |
| Open Datasets | Yes | To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present Dev AI, a new benchmark of 55 realistic AI code generation tasks... To help that, our dataset and the full implementation of Agent-as-a-Judge will be publically available at https://github.com/metau to-ai/agent-as-a-judge |
| Dataset Splits | No | The paper introduces the Dev AI dataset as a benchmark comprising 55 tasks with requirements. It does not describe any train/test/validation splits for this benchmark dataset itself, as it is used for evaluating agents on complete tasks rather than for model training with partitioned data. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run its own experiments or the Agent-as-a-Judge framework. It mentions general constraints for the evaluated AI developers such as 'the hardware you are running on is unknown, and the presence of a GPU is not guaranteed.' |
| Software Dependencies | Yes | All of these three systems require a language model as a back-end engine, for which we use gpt-4o-2024-05-13, a state-of-the-art language model... [Python Interpreter: /openhands/poetry/openhands-5O4_a CHf-py3.11/bin/python] |
| Experiment Setup | Yes | Experiment Setup All of these three systems require a language model as a back-end engine, for which we use gpt-4o-2024-05-13, a state-of-the-art language model. These AI developers were given a time-limit of 1800 seconds to solve each task and were forcefully halted if they exceeded this time limit... |