reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AgentSquare: Automatic LLM Agent Search in Modular Design Space

Authors: Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, Yong Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across six benchmarks, covering the diverse scenarios of web, embodied, tool use and game applications, show that Agent Square substantially outperforms hand-crafted agents, achieving an average performance gain of 17.2% against best-known human designs.
Researcher Affiliation	Academia	1Department of Electronic Engineering, Tsinghua University 2Shenzhen International Graduate School, Tsinghua University EMAIL
Pseudocode	Yes	The overall framework of Agent Square is illustrated in Figure 3 and the algorithm is presented in Algorithm 1. ... Algorithm 1: Algorithm of Agent Square
Open Source Code	Yes	Code repo is available at https://github.com/tsinghua-fib-lab/Agent Square.
Open Datasets	Yes	Embodied: ALFWorld (Shridhar et al., 2021) with text-based household tasks where agents navigate and interact with objects using text commands, Science World (Wang et al., 2022) with interactive science tasks requiring agents to navigate rooms and perform experiments; Game: PDDL (Ma et al., 2024) including many strategic games where agents use PDDL expressions to complete tasks; Web: Web Shop (Yao et al., 2022) focusing on online shopping tasks where agents browse and purchase products based on user instructions; Tool: Travel Planner (Xie et al., 2024) with many travel planning tasks where agents use tools and data to create detailed plans, (6)M3Tool Eval (Wang et al., 2024b) including complex tasks requiring multi-turn interactions with multiple tools.
Dataset Splits	No	The paper mentions several benchmarks and tasks but does not explicitly provide details about dataset splits (e.g., train/test/validation percentages or counts) within the paper itself. It states, "The specific performance evaluation metric varies in different tasks, following the evaluation settings in their original work," which refers to metrics, not data splits.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or cloud computing instance specifications used for running the experiments. It mentions using "GPT-3.5turbo-0125 and GPT-4o," which are language models accessed via API, implying external computational resources rather than explicitly stated local hardware.
Software Dependencies	No	The paper does not list specific versions for ancillary software dependencies (e.g., Python version, library versions like PyTorch, TensorFlow, or specific frameworks). It mentions using "GPT-3.5turbo-0125 and GPT-4o" but these are models, not a comprehensive list of software dependencies with version numbers.
Experiment Setup	Yes	Agent Square setup. We implement Agent Square and conduct experiments using both GPT-3.5turbo-0125 and GPT-4o (Achiam et al., 2023). To ensure a fair comparison, we use the same number of few-shot examples across all methods. The initial agent is set as a random module combination, and the search process terminates after 5 consecutive iterations without performance improvement.