reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems

Authors: Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Yu, Tianlong Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across six benchmarks demonstrate that Agent Prune (I) achieves comparable results as state-of-the-art topologies at merely $5.6 cost compared to their $43.7, (II) integrates seamlessly into existing multi-agent frameworks with 28.1% 72.8% token reduction, and (III) successfully defend against two types of agent-based adversarial attacks with 3.5% 10.8% performance boost.
Researcher Affiliation	Academia	Guibin Zhang1 , Yanwei Yue1 , Zhixun Li2 , Sukwon Yun3, Guancheng Wan4, Kun Wang5 , Dawei Cheng1,6, Jeffrey Xu Yu2, Tianlong Chen3 1Tongji University 2The Chinese University of Hong Kong 3University of North Carolina at Chapel Hill 4Wuhan University 5Nanyang Technological University 6Shanghai AI Laboratory
Pseudocode	Yes	Algorithm 1: Execution pipeline of LLM-MA systems from spatial-temporal graph perspective
Open Source Code	Yes	The source code is available at https://github.com/yanweiyue/Agent Prune.
Open Datasets	Yes	In our experiments, we test the performance of Agent Prune on three types of reasoning tasks and the corresponding logically challenging benchmarks: (1) General Reasoning: We opt for MMLU (Hendrycks et al., 2021) dataset; (2) Mathematical Reasoning: We select GSM8K (Cobbe et al., 2021), Multi Arith (Roy & Roth, 2016), SVAMP (Patel et al., 2021) and AQu A (Ling et al., 2017) to verify the mathematical reasoning capacity; (3) Code Generation: We use the Human Eval (Chen et al., 2021a) to test the function-level code generation ability.
Dataset Splits	Yes	For multi-query settings, we vary Q {5, 10, 20} and fix M = 10. Given a benchmark consisting of Q queries, any LLM-MA framework processes these Q queries sequentially to provide solutions one by one. We utilize the initial Q (Q << Q) queries as a training phase, collaboratively optimizing the spatio-temporal communication topology while leveraging multiple agents for reasoning and evaluation. Following this, we perform one-shot pruning as described in Equation (12). The fixed topology Gsub is then employed for the reasoning and evaluation of the remaining (Q Q ) queries.
Hardware Specification	No	We accessed the GPT models via the Open AI API, and mainly tested on gpt-3.5-turbo-0301 (gpt-3.5) and gpt-4-1106-preview (gpt-4).
Software Dependencies	Yes	We accessed the GPT models via the Open AI API, and mainly tested on gpt-3.5-turbo-0301 (gpt-3.5) and gpt-4-1106-preview (gpt-4).
Experiment Setup	Yes	We set the temperature at 1 during the generation. We set the dialogue round K = 2 for mathematical and general reasoning tasks, and K = 4 for code generation tasks. For multi-query settings, we vary Q {5, 10, 20} and fix M = 10. We generate different agent profiles using gpt-4. The pruning ratio is chosen among {50%, 30%}. More experimental details are in Appendix G.2. Initialization of graph masks: The graph masks S = SS, ST are initialized as 0.5 1\|V\|, and 1 is an all-one matrix.