Self-Evolving Multi-Agent Collaboration Networks for Software Development
Authors: Yue Hu, Yuzhu Cai, Yaxin Du, Xinyu Zhu, Xiangrui Liu, Zijie Yu, Yuchen Hou, Shuo Tang, Siheng Chen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that: i) The automatic requirement-aware evaluation in r SDE-Bench closely aligns with human evaluations, validating its reliability as a software-level coding benchmark. ii) Evo MAC outperforms previous SOTA methods on both the software-level r SDE-Bench and the function-level Human Eval benchmarks, reflecting its superior coding capabilities. |
| Researcher Affiliation | Academia | Yue Hu1, Yuzhu Cai2,3, Yaxin Du1, Xinyu Zhu1, Xiangrui Liu1, Zijie Yu1, Yuchen Hou1, Shuo Tang1, Siheng Chen1,3 1 Shanghai Jiao Tong University, 2 Beihang University, 3 Shanghai AI Laboratory 1 EMAIL, 2 EMAIL |
| Pseudocode | Yes | The overall algorithm can refer to Alg. 1 in the appendix. Algorithm 1 Self-Evolving Paradigm |
| Open Source Code | No | The benchmark can be downloaded at https://yuzhu-cai.github.io/rSDE-Bench/. This link is for the benchmark (data), not the source code for the Evo MAC methodology described in the paper. No explicit statement about releasing the code for Evo MAC is found. |
| Open Datasets | Yes | To support the development of software-level coding capabilities, we propose r SDE-Bench, a novel requirement-oriented software development benchmark... The benchmark can be downloaded at https://yuzhu-cai.github.io/rSDE-Bench/. Our experiments cover both the proposed r SDE-Bench and the standard coding benchmark Human Eval Chen et al. (2021). |
| Dataset Splits | No | r SDE-Bench involves 53 unique coding tasks and 616 test cases. r SDE-Bench introduces two requirement difficulty levels, including basic and advanced... Human Eval comprises 164 Python function completion problems. The paper describes the structure of the datasets and evaluation metrics but does not specify train/validation/test splits for model training or how the 53 tasks or 616 test cases are partitioned for standard experimental splits beyond being used for evaluation. |
| Hardware Specification | No | The paper mentions using specific LLMs like "GPT-4o-Mini" and "Claude-3.5-Sonnet" for powering the models, but it does not specify any hardware details like GPU models (e.g., NVIDIA A100), CPU types, or other computing infrastructure used for running the experiments. |
| Software Dependencies | No | The paper states that models are powered by "GPT-4o-Mini" and "Claude-3.5-Sonnet" and refers to libraries such as `pygame`, `json`, `time`, `unittest`, `selenium`, and `webdriver.Chrome` in the context of generated code and test cases. However, it does not provide specific version numbers for these software dependencies (e.g., Python 3.x, Pygame 2.x, Selenium 4.x). |
| Experiment Setup | Yes | Evo MAC with varying evolving times and two different driving LLMs. The results indicate that Evo MAC consistently improves with more evolving times and shows convincing enhancements regardless of the driving LLM used, further demonstrating the effectiveness of our self-evolving design. Algorithm 1 Self-Evolving Paradigm ... Require: K as the number of self-evolving iterations |