Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems
Authors: Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Yu, Tianlong Chen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across six benchmarks demonstrate that Agent Prune (I) achieves comparable results as state-of-the-art topologies at merely $5.6 cost compared to their $43.7, (II) integrates seamlessly into existing multi-agent frameworks with 28.1% 72.8% token reduction, and (III) successfully defend against two types of agent-based adversarial attacks with 3.5% 10.8% performance boost. |
| Researcher Affiliation | Academia | Guibin Zhang1 , Yanwei Yue1 , Zhixun Li2 , Sukwon Yun3, Guancheng Wan4, Kun Wang5 , Dawei Cheng1,6, Jeffrey Xu Yu2, Tianlong Chen3 1Tongji University 2The Chinese University of Hong Kong 3University of North Carolina at Chapel Hill 4Wuhan University 5Nanyang Technological University 6Shanghai AI Laboratory |
| Pseudocode | Yes | Algorithm 1: Execution pipeline of LLM-MA systems from spatial-temporal graph perspective |
| Open Source Code | Yes | The source code is available at https://github.com/yanweiyue/Agent Prune. |
| Open Datasets | Yes | In our experiments, we test the performance of Agent Prune on three types of reasoning tasks and the corresponding logically challenging benchmarks: (1) General Reasoning: We opt for MMLU (Hendrycks et al., 2021) dataset; (2) Mathematical Reasoning: We select GSM8K (Cobbe et al., 2021), Multi Arith (Roy & Roth, 2016), SVAMP (Patel et al., 2021) and AQu A (Ling et al., 2017) to verify the mathematical reasoning capacity; (3) Code Generation: We use the Human Eval (Chen et al., 2021a) to test the function-level code generation ability. |
| Dataset Splits | Yes | For multi-query settings, we vary Q {5, 10, 20} and fix M = 10. Given a benchmark consisting of Q queries, any LLM-MA framework processes these Q queries sequentially to provide solutions one by one. We utilize the initial Q (Q << Q) queries as a training phase, collaboratively optimizing the spatio-temporal communication topology while leveraging multiple agents for reasoning and evaluation. Following this, we perform one-shot pruning as described in Equation (12). The fixed topology Gsub is then employed for the reasoning and evaluation of the remaining (Q Q ) queries. |
| Hardware Specification | No | We accessed the GPT models via the Open AI API, and mainly tested on gpt-3.5-turbo-0301 (gpt-3.5) and gpt-4-1106-preview (gpt-4). |
| Software Dependencies | Yes | We accessed the GPT models via the Open AI API, and mainly tested on gpt-3.5-turbo-0301 (gpt-3.5) and gpt-4-1106-preview (gpt-4). |
| Experiment Setup | Yes | We set the temperature at 1 during the generation. We set the dialogue round K = 2 for mathematical and general reasoning tasks, and K = 4 for code generation tasks. For multi-query settings, we vary Q {5, 10, 20} and fix M = 10. We generate different agent profiles using gpt-4. The pruning ratio is chosen among {50%, 30%}. More experimental details are in Appendix G.2. Initialization of graph masks: The graph masks S = SS, ST are initialized as 0.5 1|V|, and 1 is an all-one matrix. |