reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Large Language Model-based Multi-Agent Collaboration

Authors: Chen Qian, Zihao Xie, YiFei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, Maosong Sun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations reveal that it effectively supports collaboration among over a thousand agents, with irregular topologies outperforming regular ones. We also identify a collaborative scaling law the overall performance follows a logistic growth pattern as agents scale, with collaborative emergence occurring earlier than traditional neural emergence. ... We performed extensive evaluations across different downstream scenarios, employing three types of representative topologies chain, tree, and graph further divided into six representative variants. The results show that MACNET surpasses all baselines on average and supports effective collaboration among over a thousand agents.
Researcher Affiliation	Academia	Tsinghua University Peng Cheng Laboratory EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes the methodology using natural language and mathematical equations, but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at https://github.com/Open BMB/Chat Dev/tree/macnet.
Open Datasets	Yes	Datasets and Metrics We adopt publicly available and logically challenging benchmarks to evaluate performance across heterogeneous downstream scenarios. MMLU (Hendrycks et al., 2021) provides a comprehensive set of logical reasoning assessments across diverse subjects and difficulties... Human Eval (Chen et al., 2021), a widely recognized benchmark for function-level code generation... SRDD (Qian et al., 2024c) integrates complex textual software requirements... Common Gen-Hard (Madaan et al., 2023) tests the ability to generate coherent sentences with discrete concepts...
Dataset Splits	No	The paper mentions evaluating on several publicly available benchmarks (MMLU, Human Eval, SRDD, Common Gen-Hard) and assessing Human Eval via 'pass@k, which reflects function correctness across multiple standard test cases.' However, it does not provide specific percentages, sample counts, or explicit methodology for how these datasets were split into training, validation, or test sets for their experiments.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU specifications, memory) used to run the experiments.
Software Dependencies	No	The paper mentions that 'GPT-3.5 is employed for interactive reasoning,' but it does not specify any other software dependencies, libraries, or frameworks with their respective version numbers that were used to conduct the experiments.
Experiment Setup	Yes	By default, we employ a topology consisting of approximately four nodes, aligning with multi-agent baselines. GPT-3.5 is employed for interactive reasoning due to its optimal balance of efficacy and efficiency, with each iterative interaction limited to three exchange rounds.