ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents

Authors: Haiyang SHEN, Yue Li, Desong Meng, Dongqi Cai, Sheng Qi, Li Zhang, Mengwei Xu, Yun Ma

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we introduce SHORTCUTSBENCH, a large-scale benchmark for the comprehensive evaluation of API-based agents in solving real-world complex tasks. ... Moreover, our extensive evaluation of agents built with 5 leading open-source (size 57B) and 5 closed-source LLMs (e.g. Gemini-1.5-Pro and GPT-4o-mini) with varying intelligence level reveals significant limitations of existing API-based agents in the whole process of handling complex queries related to API selection, parameter filling, and requesting necessary input from the system and the user.
Researcher Affiliation Academia 1Institute for Artificial Intelligence, Peking University 2School of Computer Science, Peking University 3School of Software & Microelectronics, Peking University 4School of Electronics Engineering and Computer Science, Peking University 5Beijing University of Posts and Telecommunications
Pseudocode No The paper describes methods through structured prompt templates in Figures 9, 10, 11, and 13, but does not contain sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps in a code-like format for an algorithm.
Open Source Code Yes All datasets, code, experimental logs, and results are available at https://github.com/Each Sheep/Shortcuts Bench .
Open Datasets Yes All datasets, code, experimental logs, and results are available at https://github.com/Each Sheep/Shortcuts Bench . In this paper, we introduce SHORTCUTSBENCH, a large-scale benchmark for the comprehensive evaluation of API-based agents in solving real-world complex tasks.
Dataset Splits Yes As shown in Table 3, we categorize SHORTCUTSBENCH into 4 difficulty levels and 8 task types based on |aseqi| and shortcut type , respectively. For more details, please refer to the Appendix A.3. ... The number of shortcuts in each level is denoted as np. Each query and action sequence is referred to as qp,i and aseqp,i, with 1 p 4 and 1 i np.
Hardware Specification No The paper discusses the cost of running experiments on various LLMs and their pricing per token (Table 6), but it does not specify the underlying hardware (e.g., GPU/CPU models, memory) used for these evaluations.
Software Dependencies No Referencing existing work (Huang et al., 2024b; Qin et al., 2024; Li et al., 2023), considering the performance of existing LLMs, we selected 10 most advanced LLMs to construct API-based agent. The chosen model includes 5 closed-sourced and 5 open-source LLMs, covering varying intelligence levels. No specific versions of these LLMs or other software libraries are mentioned.
Experiment Setup Yes Prompt Template. Following existing work (Huang et al., 2024b; Qin et al., 2024; Tang et al., 2023; Zhuang et al., 2024), we slightly modified the Re ACT (Yao et al., 2023) templates to construct the API-based agents. For all 3 research questions (RQs), we use the same prompt templates. An agent should correctly select APIs, fill in parameters, and be aware of the need to request necessary input from the system or user at appropriate times. Please refer to Appendix A.7 for more details.