Multi-agent Architecture Search via Agentic Supernet

Authors: Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, Xiang Wang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive evaluation across six benchmarks demonstrates that Ma AS (I) requires only 6 45% of the inference costs of existing handcrafted or automated multi-agent systems, (II) surpasses them by 0.54% 16.89%, and (III) enjoys superior cross-dataset and cross-LLM-backbone transferability. The code is available at https: //github.com/bingreeky/Ma AS. (...) We conduct comprehensive evaluations on seven widely adopted benchmarks, covering diverse use cases in code generation (Human Eval, MBPP), mathematical reasoning (GSM8K, MATH, SVAMP), and diverse tool usage (GAIA).
Researcher Affiliation Collaboration 1National University of Singapore 2Tongji University 3Nanyang Technological University 4Shanghai AI Laboratory 5University of Science and Technology of China.
Pseudocode Yes Algorithm 1 Algorithm workflow of Ma AS
Open Source Code Yes The code is available at https: //github.com/bingreeky/Ma AS.
Open Datasets Yes We evaluate Ma AS on six public benchmarks covering three domains: (1) math reasoning, GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), and Multi Arith (Roy & Roth, 2016); (2) code generation, Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021)); and (3) tool use, GAIA (Mialon et al., 2023).
Dataset Splits Yes Building upon established methodologies in workflow automation (Saad-Falcon et al., 2024; Hu et al., 2024a; Zhang et al., 2024c), we divide each dataset into training and test sets using a TRAIN:TEST ratio of 1:4. For the MATH benchmark, we adhere to (Hong et al., 2024), selecting a subset of 617 harder problems spanning four representative categories, Combinatorics & Probability, Number Theory, Pre-algebra, and Pre-calculus, all at difficulty level 5. The dataset statistics are included in Table 6.
Hardware Specification No The paper mentions leveraging LLM APIs (gpt-4o-mini-0718, Qwen-2.5-72b-instruct, llama-3.1-70b) and discusses training/inference costs and wall-clock time, but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for conducting their own experiments or model training.
Software Dependencies Yes We leverage both close-source LLM (gpt-4o-mini-0718 (Open AI, 2024)) and open-source LLM (Qwen-2.5-72b-instruct (Yang et al., 2024) and llama-3.1-70b (Dubey et al., 2024)). All models are accessed via APIs with the temperature set to 1. (...) The prompt for generating the operator profile is as follows: (...) Python (...) lightweight text embedding model (in our case, Mini LM (Wang et al., 2020)) and Sentence Bert (Reimers, 2019)
Experiment Setup Yes We set the number of layers as L = 4, the cost penalty coefficient λ as λ {1e 3, 5e 3, 1e 2}, and the sampling times K = 4. thres = 0.3 for Equation (9).