reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multi-agent Architecture Search via Agentic Supernet

Authors: Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, Xiang Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive evaluation across six benchmarks demonstrates that Ma AS (I) requires only 6 45% of the inference costs of existing handcrafted or automated multi-agent systems, (II) surpasses them by 0.54% 16.89%, and (III) enjoys superior cross-dataset and cross-LLM-backbone transferability. The code is available at https: //github.com/bingreeky/Ma AS. (...) We conduct comprehensive evaluations on seven widely adopted benchmarks, covering diverse use cases in code generation (Human Eval, MBPP), mathematical reasoning (GSM8K, MATH, SVAMP), and diverse tool usage (GAIA).
Researcher Affiliation	Collaboration	1National University of Singapore 2Tongji University 3Nanyang Technological University 4Shanghai AI Laboratory 5University of Science and Technology of China.
Pseudocode	Yes	Algorithm 1 Algorithm workflow of Ma AS
Open Source Code	Yes	The code is available at https: //github.com/bingreeky/Ma AS.
Open Datasets	Yes	We evaluate Ma AS on six public benchmarks covering three domains: (1) math reasoning, GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), and Multi Arith (Roy & Roth, 2016); (2) code generation, Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021)); and (3) tool use, GAIA (Mialon et al., 2023).
Dataset Splits	Yes	Building upon established methodologies in workflow automation (Saad-Falcon et al., 2024; Hu et al., 2024a; Zhang et al., 2024c), we divide each dataset into training and test sets using a TRAIN:TEST ratio of 1:4. For the MATH benchmark, we adhere to (Hong et al., 2024), selecting a subset of 617 harder problems spanning four representative categories, Combinatorics & Probability, Number Theory, Pre-algebra, and Pre-calculus, all at difficulty level 5. The dataset statistics are included in Table 6.
Hardware Specification	No	The paper mentions leveraging LLM APIs (gpt-4o-mini-0718, Qwen-2.5-72b-instruct, llama-3.1-70b) and discusses training/inference costs and wall-clock time, but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for conducting their own experiments or model training.
Software Dependencies	Yes	We leverage both close-source LLM (gpt-4o-mini-0718 (Open AI, 2024)) and open-source LLM (Qwen-2.5-72b-instruct (Yang et al., 2024) and llama-3.1-70b (Dubey et al., 2024)). All models are accessed via APIs with the temperature set to 1. (...) The prompt for generating the operator profile is as follows: (...) Python (...) lightweight text embedding model (in our case, Mini LM (Wang et al., 2020)) and Sentence Bert (Reimers, 2019)
Experiment Setup	Yes	We set the number of layers as L = 4, the cost penalty coefficient λ as λ {1e 3, 5e 3, 1e 2}, and the sampling times K = 4. thres = 0.3 for Equation (9).