reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Automated Design of Agentic Systems

Authors: Shengran Hu, Cong Lu, Jeff Clune

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments across multiple domains including coding, science, and math, we show that our algorithm can progressively invent agents with novel designs that greatly outperform state-of-the-art hand-designed agents.
Researcher Affiliation	Academia	Shengran Hu1,2, Cong Lu1,2, Jeff Clune1,2,3 1University of British Columbia, 2Vector Institute, 3Canada CIFAR AI Chair EMAIL, EMAIL
Pseudocode	Yes	A pseudocode of the algorithm is provided in Appendix I.
Open Source Code	Yes	All code is open-sourced at https://github.com/Shengran Hu/ADAS.
Open Datasets	Yes	We evaluate the proposed Meta Agent Search on: (1) the challenging ARC logic puzzle task (Chollet, 2019) that aims to test the general intelligence of an AI system, (2) four popular benchmarks on reading comprehension, math, science questions, and multi-task problem solving, and (3) the transferability of discovered agents to held-out domains and models (Section 4). We test Meta Agent Search on four popular benchmarks: (1) DROP (Dua et al., 2019) for evaluating Reading Comprehension; (2) MGSM (Shi et al., 2023) for evaluating Math capability under a multi-lingual setting; (3) MMLU (Hendrycks et al., 2021) for evaluating Multi-task Problem Solving; and (4) GPQA (Rein et al., 2023) for evaluating the capability of solving hard (graduate-level) questions in Science.
Dataset Splits	Yes	We sample a validation set and a test set with 20 and 60 questions, respectively, for searching and testing. ... For GPQA (Science), we use GPQA diamond and the validation set consists of 32 questions, while the remaining 166 questions form the test set. For the other domains, the validation and test sets are sampled with 128 and 800 questions, respectively.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used to run its experiments, but rather mentions using specific large language models (e.g., GPT-4, GPT-3.5, Claude-Haiku, Claude-Sonnet) via API calls. For example: "Meta Agent Search runs for 25 iterations and the meta agent uses GPT-4 (Open AI, 2024), while discovered agents and baselines are evaluated using GPT-3.5 (Open AI, 2022) to reduce compute cost." and "A single run of search and evaluation on ARC (Section 4.1) costs approximately $500 USD in Open AI API costs".
Software Dependencies	No	The paper mentions using Python for implementation and various Foundation Models (GPT, Claude) which are accessed via API, but it does not specify version numbers for Python itself or any other software libraries or dependencies. For example: "Given that most programming languages, such as Python, which we use in this paper, are Turing Complete..."
Experiment Setup	Yes	Code 3, which details the best agent on ARC, explicitly specifies hyperparameters such as `num_candidates = 5`, `max_refinement_iterations = 3`, and various `temperature` settings for different FM_Module instances (e.g., `temperature=0.8`, `temperature=0.5`, `temperature=0.6`, `temperature=0.1`).