Automated Design of Agentic Systems
Authors: Shengran Hu, Cong Lu, Jeff Clune
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments across multiple domains including coding, science, and math, we show that our algorithm can progressively invent agents with novel designs that greatly outperform state-of-the-art hand-designed agents. |
| Researcher Affiliation | Academia | Shengran Hu1,2, Cong Lu1,2, Jeff Clune1,2,3 1University of British Columbia, 2Vector Institute, 3Canada CIFAR AI Chair EMAIL, EMAIL |
| Pseudocode | Yes | A pseudocode of the algorithm is provided in Appendix I. |
| Open Source Code | Yes | All code is open-sourced at https://github.com/Shengran Hu/ADAS. |
| Open Datasets | Yes | We evaluate the proposed Meta Agent Search on: (1) the challenging ARC logic puzzle task (Chollet, 2019) that aims to test the general intelligence of an AI system, (2) four popular benchmarks on reading comprehension, math, science questions, and multi-task problem solving, and (3) the transferability of discovered agents to held-out domains and models (Section 4). We test Meta Agent Search on four popular benchmarks: (1) DROP (Dua et al., 2019) for evaluating Reading Comprehension; (2) MGSM (Shi et al., 2023) for evaluating Math capability under a multi-lingual setting; (3) MMLU (Hendrycks et al., 2021) for evaluating Multi-task Problem Solving; and (4) GPQA (Rein et al., 2023) for evaluating the capability of solving hard (graduate-level) questions in Science. |
| Dataset Splits | Yes | We sample a validation set and a test set with 20 and 60 questions, respectively, for searching and testing. ... For GPQA (Science), we use GPQA diamond and the validation set consists of 32 questions, while the remaining 166 questions form the test set. For the other domains, the validation and test sets are sampled with 128 and 800 questions, respectively. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used to run its experiments, but rather mentions using specific large language models (e.g., GPT-4, GPT-3.5, Claude-Haiku, Claude-Sonnet) via API calls. For example: "Meta Agent Search runs for 25 iterations and the meta agent uses GPT-4 (Open AI, 2024), while discovered agents and baselines are evaluated using GPT-3.5 (Open AI, 2022) to reduce compute cost." and "A single run of search and evaluation on ARC (Section 4.1) costs approximately $500 USD in Open AI API costs". |
| Software Dependencies | No | The paper mentions using Python for implementation and various Foundation Models (GPT, Claude) which are accessed via API, but it does not specify version numbers for Python itself or any other software libraries or dependencies. For example: "Given that most programming languages, such as Python, which we use in this paper, are Turing Complete..." |
| Experiment Setup | Yes | Code 3, which details the best agent on ARC, explicitly specifies hyperparameters such as `num_candidates = 5`, `max_refinement_iterations = 3`, and various `temperature` settings for different FM_Module instances (e.g., `temperature=0.8`, `temperature=0.5`, `temperature=0.6`, `temperature=0.1`). |