AgentSquare: Automatic LLM Agent Search in Modular Design Space
Authors: Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, Yong Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across six benchmarks, covering the diverse scenarios of web, embodied, tool use and game applications, show that Agent Square substantially outperforms hand-crafted agents, achieving an average performance gain of 17.2% against best-known human designs. |
| Researcher Affiliation | Academia | 1Department of Electronic Engineering, Tsinghua University 2Shenzhen International Graduate School, Tsinghua University EMAIL |
| Pseudocode | Yes | The overall framework of Agent Square is illustrated in Figure 3 and the algorithm is presented in Algorithm 1. ... Algorithm 1: Algorithm of Agent Square |
| Open Source Code | Yes | Code repo is available at https://github.com/tsinghua-fib-lab/Agent Square. |
| Open Datasets | Yes | Embodied: ALFWorld (Shridhar et al., 2021) with text-based household tasks where agents navigate and interact with objects using text commands, Science World (Wang et al., 2022) with interactive science tasks requiring agents to navigate rooms and perform experiments; Game: PDDL (Ma et al., 2024) including many strategic games where agents use PDDL expressions to complete tasks; Web: Web Shop (Yao et al., 2022) focusing on online shopping tasks where agents browse and purchase products based on user instructions; Tool: Travel Planner (Xie et al., 2024) with many travel planning tasks where agents use tools and data to create detailed plans, (6)M3Tool Eval (Wang et al., 2024b) including complex tasks requiring multi-turn interactions with multiple tools. |
| Dataset Splits | No | The paper mentions several benchmarks and tasks but does not explicitly provide details about dataset splits (e.g., train/test/validation percentages or counts) within the paper itself. It states, "The specific performance evaluation metric varies in different tasks, following the evaluation settings in their original work," which refers to metrics, not data splits. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or cloud computing instance specifications used for running the experiments. It mentions using "GPT-3.5turbo-0125 and GPT-4o," which are language models accessed via API, implying external computational resources rather than explicitly stated local hardware. |
| Software Dependencies | No | The paper does not list specific versions for ancillary software dependencies (e.g., Python version, library versions like PyTorch, TensorFlow, or specific frameworks). It mentions using "GPT-3.5turbo-0125 and GPT-4o" but these are models, not a comprehensive list of software dependencies with version numbers. |
| Experiment Setup | Yes | Agent Square setup. We implement Agent Square and conduct experiments using both GPT-3.5turbo-0125 and GPT-4o (Achiam et al., 2023). To ensure a fair comparison, we use the same number of few-shot examples across all methods. The initial agent is set as a random module combination, and the search process terminates after 5 consecutive iterations without performance improvement. |