Agent-as-a-Judge: Evaluate Agents with Agents

Authors: Mingchen Zhuge, Changsheng Zhao, Dylan R. Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply the Agent-as-a-Judge framework to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present Dev AI, a new benchmark of 55 realistic AI code generation tasks. Dev AI includes rich manual annotations, like a total of 365 hierarchical solution requirements, which make it particularly suitable for an agentic evaluator. We benchmark three of the top code-generating agentic systems using Agent-as-a-Judge and find that our framework dramatically outperforms LLMas-a-Judge and is as reliable as our human evaluation baseline.
Researcher Affiliation Collaboration 1Meta AI 2KAUST.
Pseudocode No The paper includes pipeline diagrams (Figure 8) and descriptions of modular components, but it does not contain explicit pseudocode blocks or algorithm listings.
Open Source Code Yes To help that, our dataset and the full implementation of Agent-as-a-Judge will be publically available at https://github.com/metau to-ai/agent-as-a-judge
Open Datasets Yes To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present Dev AI, a new benchmark of 55 realistic AI code generation tasks... To help that, our dataset and the full implementation of Agent-as-a-Judge will be publically available at https://github.com/metau to-ai/agent-as-a-judge
Dataset Splits No The paper introduces the Dev AI dataset as a benchmark comprising 55 tasks with requirements. It does not describe any train/test/validation splits for this benchmark dataset itself, as it is used for evaluating agents on complete tasks rather than for model training with partitioned data.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run its own experiments or the Agent-as-a-Judge framework. It mentions general constraints for the evaluated AI developers such as 'the hardware you are running on is unknown, and the presence of a GPU is not guaranteed.'
Software Dependencies Yes All of these three systems require a language model as a back-end engine, for which we use gpt-4o-2024-05-13, a state-of-the-art language model... [Python Interpreter: /openhands/poetry/openhands-5O4_a CHf-py3.11/bin/python]
Experiment Setup Yes Experiment Setup All of these three systems require a language model as a back-end engine, for which we use gpt-4o-2024-05-13, a state-of-the-art language model. These AI developers were given a time-limit of 1800 seconds to solve each task and were forcefully halted if they exceeded this time limit...