reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents

Authors: Kexun Zhang, Weiran Yao, Zuxin Liu, Yihao Feng, Zhiwei Liu, Rithesh Ramapura Narasimha Murthy, Tian Lan, Lei Li, Renze Lou, Jiacheng Xu, Bo Pang, Yingbo Zhou, Shelby Heinecke, Silvio Savarese, Huan Wang, Caiming Xiong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that a DEI-guided committee of agents is able to surpass the best individual agent s performance by a large margin. For instance, a group of open-source SWE agents, with a maximum individual resolve rate of 27.3% on SWE-Bench Lite, can achieve a 34.3% resolve rate with DEI, making a 25% improvement and beating most closed-source solutions. Our best-performing group excels with a 55% resolve rate, securing the highest ranking on SWE-Bench Lite. Our findings contribute to the growing body of research on collaborative AI systems and their potential to solve complex software engineering challenges. Section 4 is explicitly titled "EXPERIMENTS".
Researcher Affiliation	Collaboration	Kexun Zhang1,2, Weiran Yao1, Zuxin Liu1, Yihao Feng1, Zhiwei Liu1, Rithesh Murthy1, Tian Lan1, Lei Li2, Renze Lou1, Jiacheng Xu1, Bo Pang1, Yingbo Zhou1, Shelby Heinecke1, Silvio Savarese1, Huan Wang1, Caiming Xiong1 1Salesforce AI Research, 2Carnegie Mellon University
Pseudocode	No	The paper describes the framework (DEI) and its implementation (DEIBASE) in prose within Section 3.3.2 and 3.3.3, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Code, data, and generations are released at https://github.com/Salesforce AIResearch/swecomm.
Open Datasets	Yes	We conduct our experiments on SWE-Bench Lite, a 300-instance subset sampled from the full SWE-Bench for providing a more self-contained evaluation of functional bug fixes (Jimenez et al., 2024). Compared to the full SWE-Bench, SWE-Bench Lite has significantly more submissions on the leaderboard for us to conduct a more comprehensive analysis of inter-agent diversity.
Dataset Splits	Yes	We trained it on a randomly sampled subset with 150 issues in SWE-Bench and evaluated it on the remaining 150 issues.
Hardware Specification	No	The paper discusses the use of Large Language Models (LLMs) such as gpt4o and Claude 3.5 Sonnet, but it does not specify any hardware details (e.g., GPU models, CPU types, memory) used to run their experiments or framework.
Software Dependencies	No	The paper mentions specific Large Language Models (LLMs) like "gpt4o" and "Claude 3.5 Sonnet" as components of the agents and the DEIBASE committee. However, it does not provide specific version numbers for any ancillary software dependencies (e.g., programming languages like Python, or libraries such as PyTorch, TensorFlow, scikit-learn) used in their implementation.
Experiment Setup	Yes	To encourage diverse outputs, we use an inference temperature of 1.2. In most DEIBASE experiments, we allow 10 votes for each candidate patch. We sampled 10 issues that were not solved by Agentless and got low scores from DEI. We modified the bug-fixing part of the agentless framework to include DEI s output and refined the patches for at most 5 rounds.