Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents
Authors: Kexun Zhang, Weiran Yao, Zuxin Liu, Yihao Feng, Zhiwei Liu, Rithesh Ramapura Narasimha Murthy, Tian Lan, Lei Li, Renze Lou, Jiacheng Xu, Bo Pang, Yingbo Zhou, Shelby Heinecke, Silvio Savarese, Huan Wang, Caiming Xiong
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that a DEI-guided committee of agents is able to surpass the best individual agent s performance by a large margin. For instance, a group of open-source SWE agents, with a maximum individual resolve rate of 27.3% on SWE-Bench Lite, can achieve a 34.3% resolve rate with DEI, making a 25% improvement and beating most closed-source solutions. Our best-performing group excels with a 55% resolve rate, securing the highest ranking on SWE-Bench Lite. Our findings contribute to the growing body of research on collaborative AI systems and their potential to solve complex software engineering challenges. Section 4 is explicitly titled "EXPERIMENTS". |
| Researcher Affiliation | Collaboration | Kexun Zhang1,2, Weiran Yao1, Zuxin Liu1, Yihao Feng1, Zhiwei Liu1, Rithesh Murthy1, Tian Lan1, Lei Li2, Renze Lou1, Jiacheng Xu1, Bo Pang1, Yingbo Zhou1, Shelby Heinecke1, Silvio Savarese1, Huan Wang1, Caiming Xiong1 1Salesforce AI Research, 2Carnegie Mellon University |
| Pseudocode | No | The paper describes the framework (DEI) and its implementation (DEIBASE) in prose within Section 3.3.2 and 3.3.3, but it does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Code, data, and generations are released at https://github.com/Salesforce AIResearch/swecomm. |
| Open Datasets | Yes | We conduct our experiments on SWE-Bench Lite, a 300-instance subset sampled from the full SWE-Bench for providing a more self-contained evaluation of functional bug fixes (Jimenez et al., 2024). Compared to the full SWE-Bench, SWE-Bench Lite has significantly more submissions on the leaderboard for us to conduct a more comprehensive analysis of inter-agent diversity. |
| Dataset Splits | Yes | We trained it on a randomly sampled subset with 150 issues in SWE-Bench and evaluated it on the remaining 150 issues. |
| Hardware Specification | No | The paper discusses the use of Large Language Models (LLMs) such as gpt4o and Claude 3.5 Sonnet, but it does not specify any hardware details (e.g., GPU models, CPU types, memory) used to run their experiments or framework. |
| Software Dependencies | No | The paper mentions specific Large Language Models (LLMs) like "gpt4o" and "Claude 3.5 Sonnet" as components of the agents and the DEIBASE committee. However, it does not provide specific version numbers for any ancillary software dependencies (e.g., programming languages like Python, or libraries such as PyTorch, TensorFlow, scikit-learn) used in their implementation. |
| Experiment Setup | Yes | To encourage diverse outputs, we use an inference temperature of 1.2. In most DEIBASE experiments, we allow 10 votes for each candidate patch. We sampled 10 issues that were not solved by Agentless and got low scores from DEI. We modified the bug-fixing part of the agentless framework to include DEI s output and refined the patches for at most 5 rounds. |