reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement

Authors: Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, William Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Applied to the SWE-bench benchmark, our approach demonstrates a 23% relative improvement in performance across five models compared to standard open-source agents without MCTS. Our analysis reveals how performance scales with increased inference-time compute through deeper search, providing a pathway to improve software agents without requiring larger models or additional training data.
Researcher Affiliation	Collaboration	1University of California, Santa Barbara, 2Moatless AI, 3Carnegie Mellon University, 4National University of Singapore, 5Mila
Pseudocode	No	The paper describes the search algorithm, including the UCT function in equation (3), and illustrates agent actions with pseudo_code examples in Figure 2. However, it does not contain a dedicated, clearly labeled 'Pseudocode' or 'Algorithm' block for its main methodology.
Open Source Code	Yes	Code: github.com/aorwall/moatless-tree-search
Open Datasets	Yes	Benchmark For our experiments, we utilize SWE-bench Lite, a curated subset of the official SWE-bench, containing 300 instances. This dataset is specifically designed to be self-contained and focuses primarily on evaluating functional bug fixes, providing a controlled environment to assess the performance of our system.
Dataset Splits	No	The paper states using "SWE-bench Lite, a curated subset of the official SWE-bench, containing 300 instances" for experiments. It mentions applying conservative parameters across "the 300 instances" but does not explicitly provide information on how these 300 instances were split into training, validation, or test sets for the agent's learning or evaluation process.
Hardware Specification	No	The paper mentions using "Docker images built via the SWE-bench library" and running them "as pods in a Kubernetes cluster" for tests, and provides an API cost comparison. However, it does not specify any particular GPU models, CPU types, or other detailed hardware specifications used for running the experiments or training the models.
Software Dependencies	Yes	For comparison, we build upon the moatless-tools framework ( Orwall, 2024), a high-performing open-source agent commonly used in research settings (Chowdhury et al., 2024). To isolate the impact of our search approach, we adapt moatless-tools v0.0.2 as our baseline, referred to as Moatless-Adapted.
Experiment Setup	Yes	Implementation Details For consistency, we use identical prompts across all models. In SWE-Search, we limit each node to a maximum of three expansions and cap the total search iterations at 100. Further details on model hyperparameters can be found in Appendix, 2.