SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement
Authors: Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, William Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Applied to the SWE-bench benchmark, our approach demonstrates a 23% relative improvement in performance across five models compared to standard open-source agents without MCTS. Our analysis reveals how performance scales with increased inference-time compute through deeper search, providing a pathway to improve software agents without requiring larger models or additional training data. |
| Researcher Affiliation | Collaboration | 1University of California, Santa Barbara, 2Moatless AI, 3Carnegie Mellon University, 4National University of Singapore, 5Mila |
| Pseudocode | No | The paper describes the search algorithm, including the UCT function in equation (3), and illustrates agent actions with pseudo_code examples in Figure 2. However, it does not contain a dedicated, clearly labeled 'Pseudocode' or 'Algorithm' block for its main methodology. |
| Open Source Code | Yes | Code: github.com/aorwall/moatless-tree-search |
| Open Datasets | Yes | Benchmark For our experiments, we utilize SWE-bench Lite, a curated subset of the official SWE-bench, containing 300 instances. This dataset is specifically designed to be self-contained and focuses primarily on evaluating functional bug fixes, providing a controlled environment to assess the performance of our system. |
| Dataset Splits | No | The paper states using "SWE-bench Lite, a curated subset of the official SWE-bench, containing 300 instances" for experiments. It mentions applying conservative parameters across "the 300 instances" but does not explicitly provide information on how these 300 instances were split into training, validation, or test sets for the agent's learning or evaluation process. |
| Hardware Specification | No | The paper mentions using "Docker images built via the SWE-bench library" and running them "as pods in a Kubernetes cluster" for tests, and provides an API cost comparison. However, it does not specify any particular GPU models, CPU types, or other detailed hardware specifications used for running the experiments or training the models. |
| Software Dependencies | Yes | For comparison, we build upon the moatless-tools framework ( Orwall, 2024), a high-performing open-source agent commonly used in research settings (Chowdhury et al., 2024). To isolate the impact of our search approach, we adapt moatless-tools v0.0.2 as our baseline, referred to as Moatless-Adapted. |
| Experiment Setup | Yes | Implementation Details For consistency, we use identical prompts across all models. In SWE-Search, we limit each node to a maximum of three expansions and cap the total search iterations at 100. Further details on model hyperparameters can be found in Appendix, 2. |