reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Self-Evolving Multi-Agent Collaboration Networks for Software Development

Authors: Yue Hu, Yuzhu Cai, Yaxin Du, Xinyu Zhu, Xiangrui Liu, Zijie Yu, Yuchen Hou, Shuo Tang, Siheng Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that: i) The automatic requirement-aware evaluation in r SDE-Bench closely aligns with human evaluations, validating its reliability as a software-level coding benchmark. ii) Evo MAC outperforms previous SOTA methods on both the software-level r SDE-Bench and the function-level Human Eval benchmarks, reflecting its superior coding capabilities.
Researcher Affiliation	Academia	Yue Hu1, Yuzhu Cai2,3, Yaxin Du1, Xinyu Zhu1, Xiangrui Liu1, Zijie Yu1, Yuchen Hou1, Shuo Tang1, Siheng Chen1,3 1 Shanghai Jiao Tong University, 2 Beihang University, 3 Shanghai AI Laboratory 1 EMAIL, 2 EMAIL
Pseudocode	Yes	The overall algorithm can refer to Alg. 1 in the appendix. Algorithm 1 Self-Evolving Paradigm
Open Source Code	No	The benchmark can be downloaded at https://yuzhu-cai.github.io/rSDE-Bench/. This link is for the benchmark (data), not the source code for the Evo MAC methodology described in the paper. No explicit statement about releasing the code for Evo MAC is found.
Open Datasets	Yes	To support the development of software-level coding capabilities, we propose r SDE-Bench, a novel requirement-oriented software development benchmark... The benchmark can be downloaded at https://yuzhu-cai.github.io/rSDE-Bench/. Our experiments cover both the proposed r SDE-Bench and the standard coding benchmark Human Eval Chen et al. (2021).
Dataset Splits	No	r SDE-Bench involves 53 unique coding tasks and 616 test cases. r SDE-Bench introduces two requirement difficulty levels, including basic and advanced... Human Eval comprises 164 Python function completion problems. The paper describes the structure of the datasets and evaluation metrics but does not specify train/validation/test splits for model training or how the 53 tasks or 616 test cases are partitioned for standard experimental splits beyond being used for evaluation.
Hardware Specification	No	The paper mentions using specific LLMs like "GPT-4o-Mini" and "Claude-3.5-Sonnet" for powering the models, but it does not specify any hardware details like GPU models (e.g., NVIDIA A100), CPU types, or other computing infrastructure used for running the experiments.
Software Dependencies	No	The paper states that models are powered by "GPT-4o-Mini" and "Claude-3.5-Sonnet" and refers to libraries such as `pygame`, `json`, `time`, `unittest`, `selenium`, and `webdriver.Chrome` in the context of generated code and test cases. However, it does not provide specific version numbers for these software dependencies (e.g., Python 3.x, Pygame 2.x, Selenium 4.x).
Experiment Setup	Yes	Evo MAC with varying evolving times and two different driving LLMs. The results indicate that Evo MAC consistently improves with more evolving times and shows convincing enhancements regardless of the driving LLM used, further demonstrating the effectiveness of our self-evolving design. Algorithm 1 Self-Evolving Paradigm ... Require: K as the number of self-evolving iterations