reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents

Authors: Haiyang SHEN, Yue Li, Desong Meng, Dongqi Cai, Sheng Qi, Li Zhang, Mengwei Xu, Yun Ma

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we introduce SHORTCUTSBENCH, a large-scale benchmark for the comprehensive evaluation of API-based agents in solving real-world complex tasks. ... Moreover, our extensive evaluation of agents built with 5 leading open-source (size 57B) and 5 closed-source LLMs (e.g. Gemini-1.5-Pro and GPT-4o-mini) with varying intelligence level reveals significant limitations of existing API-based agents in the whole process of handling complex queries related to API selection, parameter filling, and requesting necessary input from the system and the user.
Researcher Affiliation	Academia	1Institute for Artificial Intelligence, Peking University 2School of Computer Science, Peking University 3School of Software & Microelectronics, Peking University 4School of Electronics Engineering and Computer Science, Peking University 5Beijing University of Posts and Telecommunications
Pseudocode	No	The paper describes methods through structured prompt templates in Figures 9, 10, 11, and 13, but does not contain sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps in a code-like format for an algorithm.
Open Source Code	Yes	All datasets, code, experimental logs, and results are available at https://github.com/Each Sheep/Shortcuts Bench .
Open Datasets	Yes	All datasets, code, experimental logs, and results are available at https://github.com/Each Sheep/Shortcuts Bench . In this paper, we introduce SHORTCUTSBENCH, a large-scale benchmark for the comprehensive evaluation of API-based agents in solving real-world complex tasks.
Dataset Splits	Yes	As shown in Table 3, we categorize SHORTCUTSBENCH into 4 difficulty levels and 8 task types based on \|aseqi\| and shortcut type , respectively. For more details, please refer to the Appendix A.3. ... The number of shortcuts in each level is denoted as np. Each query and action sequence is referred to as qp,i and aseqp,i, with 1 p 4 and 1 i np.
Hardware Specification	No	The paper discusses the cost of running experiments on various LLMs and their pricing per token (Table 6), but it does not specify the underlying hardware (e.g., GPU/CPU models, memory) used for these evaluations.
Software Dependencies	No	Referencing existing work (Huang et al., 2024b; Qin et al., 2024; Li et al., 2023), considering the performance of existing LLMs, we selected 10 most advanced LLMs to construct API-based agent. The chosen model includes 5 closed-sourced and 5 open-source LLMs, covering varying intelligence levels. No specific versions of these LLMs or other software libraries are mentioned.
Experiment Setup	Yes	Prompt Template. Following existing work (Huang et al., 2024b; Qin et al., 2024; Tang et al., 2023; Zhuang et al., 2024), we slightly modified the Re ACT (Yao et al., 2023) templates to construct the API-based agents. For all 3 research questions (RQs), we use the same prompt templates. An agent should correctly select APIs, fill in parameters, and be aware of the need to request necessary input from the system or user at appropriate times. Please refer to Appendix A.7 for more details.