reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Agent-as-a-Judge: Evaluate Agents with Agents

Authors: Mingchen Zhuge, Changsheng Zhao, Dylan R. Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply the Agent-as-a-Judge framework to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present Dev AI, a new benchmark of 55 realistic AI code generation tasks. Dev AI includes rich manual annotations, like a total of 365 hierarchical solution requirements, which make it particularly suitable for an agentic evaluator. We benchmark three of the top code-generating agentic systems using Agent-as-a-Judge and find that our framework dramatically outperforms LLMas-a-Judge and is as reliable as our human evaluation baseline.
Researcher Affiliation	Collaboration	1Meta AI 2KAUST.
Pseudocode	No	The paper includes pipeline diagrams (Figure 8) and descriptions of modular components, but it does not contain explicit pseudocode blocks or algorithm listings.
Open Source Code	Yes	To help that, our dataset and the full implementation of Agent-as-a-Judge will be publically available at https://github.com/metau to-ai/agent-as-a-judge
Open Datasets	Yes	To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present Dev AI, a new benchmark of 55 realistic AI code generation tasks... To help that, our dataset and the full implementation of Agent-as-a-Judge will be publically available at https://github.com/metau to-ai/agent-as-a-judge
Dataset Splits	No	The paper introduces the Dev AI dataset as a benchmark comprising 55 tasks with requirements. It does not describe any train/test/validation splits for this benchmark dataset itself, as it is used for evaluating agents on complete tasks rather than for model training with partitioned data.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run its own experiments or the Agent-as-a-Judge framework. It mentions general constraints for the evaluated AI developers such as 'the hardware you are running on is unknown, and the presence of a GPU is not guaranteed.'
Software Dependencies	Yes	All of these three systems require a language model as a back-end engine, for which we use gpt-4o-2024-05-13, a state-of-the-art language model... [Python Interpreter: /openhands/poetry/openhands-5O4_a CHf-py3.11/bin/python]
Experiment Setup	Yes	Experiment Setup All of these three systems require a language model as a back-end engine, for which we use gpt-4o-2024-05-13, a state-of-the-art language model. These AI developers were given a time-limit of 1800 seconds to solve each task and were forcefully halted if they exceeded this time limit...