reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Automated Benchmark Generation for Repository-Level Coding Tasks

Authors: Konstantinos Vergopoulos, Mark Niklas Mueller, Martin Vechev

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using SETUPAGENT, we generate two new datasets: (i) SWEE-Bench an extended version of SWEBench encompassing hundreds of repositories, and (ii) SWA-Bench a benchmark focusing on applications rather than libraries. Comparing these datasets to SWE-Bench with respect to their characteristics and code agent performance, we find significant distributional differences, including lower issue description quality and detail level, higher fix complexity, and most importantly up to 60% lower agent success rates.
Researcher Affiliation	Collaboration	1Logic Star AI 2Department of Computer Science, ETH Zurich. Correspondence to: Mark Niklas Müller <EMAIL>.
Pseudocode	No	The paper describes methods through textual descriptions and illustrative figures (Figures 1, 2, 3, 4) showing input/output of LLM steps, but does not present formal pseudocode blocks or algorithms.
Open Source Code	Yes	We publish SWA-Bench on Hugging Face and the corresponding docker containers at logicstarai/swa-bench. A suitable evaluation harness is available at github.com/logic-star-ai/SWEBench.
Open Datasets	Yes	Using SETUPAGENT, we generate two new datasets: (i) SWEE-Bench an extended version of SWEBench encompassing hundreds of repositories, and (ii) SWA-Bench a benchmark focusing on applications rather than libraries. We publish SWA-Bench on Hugging Face and the corresponding docker containers at logicstarai/swa-bench.
Dataset Splits	No	The paper states, 'We conduct all below experiments on the full SWA and uniformly subsampled versions of SWEE and SWE-Full of identical size (535 instances) due to cost constraints.' This describes a sampling strategy for evaluation but does not provide specific training/validation/test splits needed to reproduce the data partitioning with sufficient detail (e.g., random seed, exact selection criteria for uniform subsampling).
Hardware Specification	No	The paper states that 'We run all code execution (both for SETUPAGENT and all Code Agents) in separate Docker containers to improve reproducibility and security' but does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies	Yes	For SETUPAGENT, we use an Ubuntu 22.04 container as the base image and pre-install a range of common build dependencies but do not provide any Python dependencies. SETUPAGENT enforces the use of the uv environment manager for Python dependencies. For exact versions, see Table 9 in App. A. Table 9 specifies model IDs such as 'gpt-4o-2024-08-06' for GPT-4O and 'claude-3-5-haiku-20241022' for HAIKU-3.5.
Experiment Setup	No	The paper states, 'For decoding, we use the default parameters for all Code Agents and greedy decoding for SETUPAGENT,' but does not provide specific hyperparameter values or detailed training configurations for these agents or for SETUPAGENT itself beyond these general statements.