Self-Evolving Multi-Agent Collaboration Networks for Software Development

Authors: Yue Hu, Yuzhu Cai, Yaxin Du, Xinyu Zhu, Xiangrui Liu, Zijie Yu, Yuchen Hou, Shuo Tang, Siheng Chen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that: i) The automatic requirement-aware evaluation in r SDE-Bench closely aligns with human evaluations, validating its reliability as a software-level coding benchmark. ii) Evo MAC outperforms previous SOTA methods on both the software-level r SDE-Bench and the function-level Human Eval benchmarks, reflecting its superior coding capabilities.
Researcher Affiliation Academia Yue Hu1, Yuzhu Cai2,3, Yaxin Du1, Xinyu Zhu1, Xiangrui Liu1, Zijie Yu1, Yuchen Hou1, Shuo Tang1, Siheng Chen1,3 1 Shanghai Jiao Tong University, 2 Beihang University, 3 Shanghai AI Laboratory 1 EMAIL, 2 EMAIL
Pseudocode Yes The overall algorithm can refer to Alg. 1 in the appendix. Algorithm 1 Self-Evolving Paradigm
Open Source Code No The benchmark can be downloaded at https://yuzhu-cai.github.io/rSDE-Bench/. This link is for the benchmark (data), not the source code for the Evo MAC methodology described in the paper. No explicit statement about releasing the code for Evo MAC is found.
Open Datasets Yes To support the development of software-level coding capabilities, we propose r SDE-Bench, a novel requirement-oriented software development benchmark... The benchmark can be downloaded at https://yuzhu-cai.github.io/rSDE-Bench/. Our experiments cover both the proposed r SDE-Bench and the standard coding benchmark Human Eval Chen et al. (2021).
Dataset Splits No r SDE-Bench involves 53 unique coding tasks and 616 test cases. r SDE-Bench introduces two requirement difficulty levels, including basic and advanced... Human Eval comprises 164 Python function completion problems. The paper describes the structure of the datasets and evaluation metrics but does not specify train/validation/test splits for model training or how the 53 tasks or 616 test cases are partitioned for standard experimental splits beyond being used for evaluation.
Hardware Specification No The paper mentions using specific LLMs like "GPT-4o-Mini" and "Claude-3.5-Sonnet" for powering the models, but it does not specify any hardware details like GPU models (e.g., NVIDIA A100), CPU types, or other computing infrastructure used for running the experiments.
Software Dependencies No The paper states that models are powered by "GPT-4o-Mini" and "Claude-3.5-Sonnet" and refers to libraries such as `pygame`, `json`, `time`, `unittest`, `selenium`, and `webdriver.Chrome` in the context of generated code and test cases. However, it does not provide specific version numbers for these software dependencies (e.g., Python 3.x, Pygame 2.x, Selenium 4.x).
Experiment Setup Yes Evo MAC with varying evolving times and two different driving LLMs. The results indicate that Evo MAC consistently improves with more evolving times and shows convincing enhancements regardless of the driving LLM used, further demonstrating the effectiveness of our self-evolving design. Algorithm 1 Self-Evolving Paradigm ... Require: K as the number of self-evolving iterations