Multi-agent Architecture Search via Agentic Supernet
Authors: Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, Xiang Wang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive evaluation across six benchmarks demonstrates that Ma AS (I) requires only 6 45% of the inference costs of existing handcrafted or automated multi-agent systems, (II) surpasses them by 0.54% 16.89%, and (III) enjoys superior cross-dataset and cross-LLM-backbone transferability. The code is available at https: //github.com/bingreeky/Ma AS. (...) We conduct comprehensive evaluations on seven widely adopted benchmarks, covering diverse use cases in code generation (Human Eval, MBPP), mathematical reasoning (GSM8K, MATH, SVAMP), and diverse tool usage (GAIA). |
| Researcher Affiliation | Collaboration | 1National University of Singapore 2Tongji University 3Nanyang Technological University 4Shanghai AI Laboratory 5University of Science and Technology of China. |
| Pseudocode | Yes | Algorithm 1 Algorithm workflow of Ma AS |
| Open Source Code | Yes | The code is available at https: //github.com/bingreeky/Ma AS. |
| Open Datasets | Yes | We evaluate Ma AS on six public benchmarks covering three domains: (1) math reasoning, GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), and Multi Arith (Roy & Roth, 2016); (2) code generation, Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021)); and (3) tool use, GAIA (Mialon et al., 2023). |
| Dataset Splits | Yes | Building upon established methodologies in workflow automation (Saad-Falcon et al., 2024; Hu et al., 2024a; Zhang et al., 2024c), we divide each dataset into training and test sets using a TRAIN:TEST ratio of 1:4. For the MATH benchmark, we adhere to (Hong et al., 2024), selecting a subset of 617 harder problems spanning four representative categories, Combinatorics & Probability, Number Theory, Pre-algebra, and Pre-calculus, all at difficulty level 5. The dataset statistics are included in Table 6. |
| Hardware Specification | No | The paper mentions leveraging LLM APIs (gpt-4o-mini-0718, Qwen-2.5-72b-instruct, llama-3.1-70b) and discusses training/inference costs and wall-clock time, but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for conducting their own experiments or model training. |
| Software Dependencies | Yes | We leverage both close-source LLM (gpt-4o-mini-0718 (Open AI, 2024)) and open-source LLM (Qwen-2.5-72b-instruct (Yang et al., 2024) and llama-3.1-70b (Dubey et al., 2024)). All models are accessed via APIs with the temperature set to 1. (...) The prompt for generating the operator profile is as follows: (...) Python (...) lightweight text embedding model (in our case, Mini LM (Wang et al., 2020)) and Sentence Bert (Reimers, 2019) |
| Experiment Setup | Yes | We set the number of layers as L = 4, the cost penalty coefficient λ as λ {1e 3, 5e 3, 1e 2}, and the sampling times K = 4. thres = 0.3 for Equation (9). |