reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AI Agents That Matter

Authors: Sayash Kapoor, Benedikt Stroebl, Zachary S Siegel, Nitya Nadgir, Arvind Narayanan

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	AI agents are an exciting new research direction, and agent development is driven by benchmarks. Our analysis of current agent benchmarks and evaluation practices reveals several shortcomings that hinder their usefulness in real-world applications. First, there is a narrow focus on accuracy without attention to other metrics. As a result, SOTA agents are needlessly complex and costly, and the community has reached mistaken conclusions about the sources of accuracy gains. Our focus on cost in addition to accuracy motivates the new goal of jointly optimizing the two metrics. We design and implement one such optimization, showing its potential to greatly reduce cost while maintaining accuracy. Second, the benchmarking needs of model and downstream developers have been conflated, making it hard to identify which agent would be best suited for a particular application. Third, many agent benchmarks have inadequate holdout sets, and sometimes none at all. This has led to agents that are fragile because they take shortcuts and overfit to the benchmark in various ways. We prescribe a principled framework for avoiding overfitting. Finally, there is a lack of standardization in evaluation practices, leading to a pervasive lack of reproducibility. We hope that the steps we introduce for addressing these shortcomings will spur the development of agents that are useful in the real world and not just accurate on benchmarks. This paper empirically demonstrates these challenges and provides recommendations for addressing them.
Researcher Affiliation	Academia	Sayash Kapoor EMAIL Department of Computer Science Center for Information Technology Policy Princeton University Benedikt Stroebl EMAIL Center for Information Technology Policy Princeton University Zachary S. Siegel Department of Computer Science Center for Information Technology Policy Princeton University Nitya Nadgir Center for Information Technology Policy Princeton University Arvind Narayanan EMAIL Department of Computer Science Center for Information Technology Policy Princeton University
Pseudocode	No	The paper describes various agent designs and optimization processes (e.g., modifying the DSPy framework, using Optuna for parameter search) and provides a prompt example (Listing 1), but it does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	We release code to reproduce all experimental results of this paper in a Git Hub repository under a MIT license.19 This includes scripts to reproduce our analyses of Human Eval, Hot Pot QA, Novel QA, and Web Arena, as well as implementations of our proposed baselines and the DSPy implementation for our joint optimizer. As an example of an interface that lets downstream users explore the impact of varying API costs, we also provide an interactive web application.20 This allows users to input current pricing for different language models and visualize the adjusted cost-accuracy tradeoffs on Human Eval for the agents we evaluated (Section 2). Finally, we plan to release our joint optimizer to the official DSPy repository and the research community. To show how we compute Pareto frontiers, we provide a simple example implementation on simulated agent evaluation data.21
Open Datasets	Yes	Specifically, we included agents from the Human Eval leaderboard on Papers With Code that share their code publicly (Chen et al., 2021): LDB (Zhong et al., 2024), LATS (Zhou et al., 2023), and Reflexion (Shinn et al., 2023). As an illustration of the potential of joint optimization, we modify the DSPy framework (Khattab et al., 2023) and evaluate it on the Hot Pot QA benchmark (Yang et al., 2018). Through a case study of Novel QA (Wang et al., 2024a), we show how benchmarks meant for model evaluation can be misleading when used for downstream evaluation. Web Arena is an agent benchmark that aims to evaluate agents on tasks on the web (Zhou et al., 2024).
Dataset Splits	Yes	We use the modified benchmark version of Human Eval provided with the LDB paper (Zhong et al., 2024) since it includes example test cases for all 164 tasks (in the original benchmark, example test cases are provided for only 161 of 164 tasks, as detailed in Section 6). We use 100 samples from the Hot Pot QA training set to optimize the DSPy pipelines and 200 samples from the evaluation set to evaluate the results (this is consistent with the implementation of the DSPy pipelines provided by the developers to illustrate efficacy at multi-hop retrieval).
Hardware Specification	No	For all our experiments using Open AI models, we utilized the endpoints provided by Open AI16, either directly or through the Azure Open AI Service17. For the analysis on Hot Pot QA using Llama-3 models, we relied on the endpoints provided by Together.ai.18 As our work primarily relied on external APIs, we did not use any GPUs for inference and our experiments did not require training of LLMs.
Software Dependencies	No	The paper mentions using the DSPy framework (Khattab et al., 2023) and the Optuna hyperparameter optimization framework (Akiba et al., 2019), as well as specific language models accessed via API (e.g., gpt-3.5-turbo-0125, gpt-4-turbo-2024-04-09, Llama-3-70B) and the ColBERTv2 retriever model. However, specific version numbers for the DSPy or Optuna frameworks, or for the ColBERTv2 model, are not provided.
Experiment Setup	Yes	We kept all parameters as specified in the code accompanying the original paper. In particular, this means that the maximum number of iterations is set to 10 and the temperature to zero. Based on correspondence with the authors, we set the maximum number of iterations to 8, the expansion factor to 3, and the temperature values for generating the function implementations to 0.8. The temperature for generating self-reflections and the internal unit tests was set to 0.2. The maximum number of internal test cases was set to 6 for runs with GPT-3.5 and 4 for runs using GPT-4. We left all parameters unchanged from the ones provided in the original repository, setting the maximum number of iterations to 2, expansion factor to 3, and temperature to zero for generating function implementations. The temperature used for generating the internal tests and self-reflections is 0.2. For the warming baseline, we modify the retry baseline by gradually increasing the temperature parameter across successive trials. Initially, the temperature was set at zero, mirroring the retry baseline. For the second and third trials, we raised the temperature to 0.3, and for the final two trials, we increased it further to 0.5. In our objective function required by Optuna, we sample values to search over the following parameters to find Pareto-optimal agent designs: (a) the temperature for each module within the agent, (b) the number of few-shot examples, (c) the selection of specific examples to include, and (d) whether to add formatting instructions. Candidate temperature values for each module in the agent pipeline are sampled from 0.0, 0.2, 0.4, and 0.6. We set the number of trials Optuna conducts to 16. The maximum number of few-shot examples per prompt is set to 8.