Automated Benchmark Generation for Repository-Level Coding Tasks
Authors: Konstantinos Vergopoulos, Mark Niklas Mueller, Martin Vechev
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using SETUPAGENT, we generate two new datasets: (i) SWEE-Bench an extended version of SWEBench encompassing hundreds of repositories, and (ii) SWA-Bench a benchmark focusing on applications rather than libraries. Comparing these datasets to SWE-Bench with respect to their characteristics and code agent performance, we find significant distributional differences, including lower issue description quality and detail level, higher fix complexity, and most importantly up to 60% lower agent success rates. |
| Researcher Affiliation | Collaboration | 1Logic Star AI 2Department of Computer Science, ETH Zurich. Correspondence to: Mark Niklas Müller <EMAIL>. |
| Pseudocode | No | The paper describes methods through textual descriptions and illustrative figures (Figures 1, 2, 3, 4) showing input/output of LLM steps, but does not present formal pseudocode blocks or algorithms. |
| Open Source Code | Yes | We publish SWA-Bench on Hugging Face and the corresponding docker containers at logicstarai/swa-bench. A suitable evaluation harness is available at github.com/logic-star-ai/SWEBench. |
| Open Datasets | Yes | Using SETUPAGENT, we generate two new datasets: (i) SWEE-Bench an extended version of SWEBench encompassing hundreds of repositories, and (ii) SWA-Bench a benchmark focusing on applications rather than libraries. We publish SWA-Bench on Hugging Face and the corresponding docker containers at logicstarai/swa-bench. |
| Dataset Splits | No | The paper states, 'We conduct all below experiments on the full SWA and uniformly subsampled versions of SWEE and SWE-Full of identical size (535 instances) due to cost constraints.' This describes a sampling strategy for evaluation but does not provide specific training/validation/test splits needed to reproduce the data partitioning with sufficient detail (e.g., random seed, exact selection criteria for uniform subsampling). |
| Hardware Specification | No | The paper states that 'We run all code execution (both for SETUPAGENT and all Code Agents) in separate Docker containers to improve reproducibility and security' but does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments. |
| Software Dependencies | Yes | For SETUPAGENT, we use an Ubuntu 22.04 container as the base image and pre-install a range of common build dependencies but do not provide any Python dependencies. SETUPAGENT enforces the use of the uv environment manager for Python dependencies. For exact versions, see Table 9 in App. A. Table 9 specifies model IDs such as 'gpt-4o-2024-08-06' for GPT-4O and 'claude-3-5-haiku-20241022' for HAIKU-3.5. |
| Experiment Setup | No | The paper states, 'For decoding, we use the default parameters for all Code Agents and greedy decoding for SETUPAGENT,' but does not provide specific hyperparameter values or detailed training configurations for these agents or for SETUPAGENT itself beyond these general statements. |