Scaling Large Language Model-based Multi-Agent Collaboration
Authors: Chen Qian, Zihao Xie, YiFei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, Maosong Sun
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations reveal that it effectively supports collaboration among over a thousand agents, with irregular topologies outperforming regular ones. We also identify a collaborative scaling law the overall performance follows a logistic growth pattern as agents scale, with collaborative emergence occurring earlier than traditional neural emergence. ... We performed extensive evaluations across different downstream scenarios, employing three types of representative topologies chain, tree, and graph further divided into six representative variants. The results show that MACNET surpasses all baselines on average and supports effective collaboration among over a thousand agents. |
| Researcher Affiliation | Academia | Tsinghua University Peng Cheng Laboratory EMAIL EMAIL EMAIL |
| Pseudocode | No | The paper describes the methodology using natural language and mathematical equations, but does not present any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/Open BMB/Chat Dev/tree/macnet. |
| Open Datasets | Yes | Datasets and Metrics We adopt publicly available and logically challenging benchmarks to evaluate performance across heterogeneous downstream scenarios. MMLU (Hendrycks et al., 2021) provides a comprehensive set of logical reasoning assessments across diverse subjects and difficulties... Human Eval (Chen et al., 2021), a widely recognized benchmark for function-level code generation... SRDD (Qian et al., 2024c) integrates complex textual software requirements... Common Gen-Hard (Madaan et al., 2023) tests the ability to generate coherent sentences with discrete concepts... |
| Dataset Splits | No | The paper mentions evaluating on several publicly available benchmarks (MMLU, Human Eval, SRDD, Common Gen-Hard) and assessing Human Eval via 'pass@k, which reflects function correctness across multiple standard test cases.' However, it does not provide specific percentages, sample counts, or explicit methodology for how these datasets were split into training, validation, or test sets for their experiments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU specifications, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions that 'GPT-3.5 is employed for interactive reasoning,' but it does not specify any other software dependencies, libraries, or frameworks with their respective version numbers that were used to conduct the experiments. |
| Experiment Setup | Yes | By default, we employ a topology consisting of approximately four nodes, aligning with multi-agent baselines. GPT-3.5 is employed for interactive reasoning due to its optimal balance of efficacy and efficiency, with each iterative interaction limited to three exchange rounds. |