reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks

Authors: Guibin Zhang, Yanwei Yue, Xiangguo Sun, Guancheng Wan, Miao Yu, Junfeng Fang, Kun Wang, Tianlong Chen, Dawei Cheng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on six benchmarks showcase that G-Designer is: (1) high-performing, achieving superior results on MMLU with accuracy at 84.50% and on Human Eval with pass@1 at 89.90%; (2) taskadaptive, architecting communication protocols tailored to task difficulty, reducing token consumption by up to 95.33% on Human Eval; and (3) adversarially robust, defending against agent adversarial attacks with merely 0.3% accuracy drop.
Researcher Affiliation	Academia	1Tongji University 2NUS 3CUHK 4UCLA 5USTC 6NTU 7UNC-Chapel Hill. Correspondence to: Kun Wang <EMAIL>, Dawei Cheng <EMAIL>.
Pseudocode	Yes	Algorithm 1 Designing workflow of G-Designer
Open Source Code	Yes	The code is available at https://github. com/yanweiyue/GDesigner.
Open Datasets	Yes	We evaluate G-Designer on three categories of datasets: General Reasoning: MMLU (Hendrycks et al., 2021); Mathematical Reasoning: GSM8K (Cobbe et al., 2021), Multi Arith (Roy & Roth, 2016), SVAMP (Patel et al., 2021), and AQu A (Ling et al., 2017); Code: Human Eval (Chen et al., 2021). We include the dataset statistics in Table 4.
Dataset Splits	Yes	Given a benchmark {Qi}D i=1 consisting of B queries, G-Designer begins by optimizing with a small subset of B queries and fixes the learned parameters for testing on the remaining (B B ) queries. [...] For all benchmarks, we merely use B {40, 80} queries for optimization.
Hardware Specification	No	The paper mentions 'GPU cost' in Table 5 but does not provide specific details about the type of GPU, CPU, or other hardware used for running their experiments. It only refers to accessing 'GPT via the Open AI API' which is an external service.
Software Dependencies	Yes	We access the GPT via the Open AI API, and mainly test on gpt-4-1106-preview (gpt-4) and gpt-3.5-turbo-0125 (gpt-3.5). We set temperature to 0 for the single execution and single agent baselines and 1 for multi-agent methods. We set a summarizer agent to aggregate the dialogue history and produce the final solution a(K), with K = 3 across all experiments. The Node Encoder( ) is implemented using all-Mini LM-L6-v2 (Wang et al., 2020), with the embedding dimension set to D = 384.
Experiment Setup	Yes	We set temperature to 0 for the single execution and single agent baselines and 1 for multi-agent methods. We set a summarizer agent to aggregate the dialogue history and produce the final solution a(K), with K = 3 across all experiments. The Node Encoder( ) is implemented using all-Mini LM-L6-v2 (Wang et al., 2020), with the embedding dimension set to D = 384. The anchor topology Aanchor is predefined as a simple chain structure. The sampling times M are set as 10, and τ = 1e 2 and ζ = 1e 1 are set for all experiments. We provide explicit agent profiling for multi-agent methods, following the classical configurations in LLM-MA systems (Liu et al., 2023; Zhuge et al., 2024; Yin et al., 2023), and use gpt-4 to generate agent profile pools. For all benchmarks, we merely use B {40, 80} queries for optimization.