reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code

Authors: Dhruv Gautam, Spandan Garg, Jinu Jang, Neel Sundaresan, Roshanak Zilouchian

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To better understand the unique limitations of LM agents, we introduce Refactor Bench, a benchmark consisting of 100 large handcrafted multi-file refactoring tasks in popular open-source repositories. ... Baselines on Refactor Bench reveal that current LM agents struggle with simple compositional tasks, solving only 22% of tasks with base instructions, in contrast to a human developer with short time constraints solving 87%. Through trajectory analysis, we identify various unique failure modes of LM agents, and further explore the failure mode of tracking past actions. By adapting a baseline agent to condition on representations of state, we achieve a 43.9% improvement in solving Refactor Bench tasks.
Researcher Affiliation	Collaboration	1UC Berkeley 2Microsoft
Pseudocode	Yes	G SIMPLE SINGLE AGENT STATE-AWARE IMPLEMENTATION As an self-contained example, we have a simple implementation of a state-aware interface contained within a singular agent instance. This state command tracks all it s previous edit commands and concatenates them in a separate section. In practice and for results in the paper, we augment the state cache to relay more information about related edits by integrating parts of previous observations as well. state_command: name: state code: \| state() { local working_dir="$PWD" local open_file_info="${CURRENT_FILE:-n/a}" local recent_edits_json='[]' if [ -n "$RECENT_EDITS" ]; then # Split $RECENT_EDITS into an array of edits IFS=' \| ' read -r -a edits_array <<< "$RECENT_EDITS" declare -A seen_edits # Filter out duplicate filename-line_number pairs filtered_edits=() for edit in "${edits_array[@]}"; do filename=$(echo "$edit" \| cut -d':' -f1) line_number=$(echo "$edit" \| cut -d':' -f2) # Check if this filename:line_number pair has been seen before if [ -z "${seen_edits["$filename:$line_number"]}" ]; then filtered_edits+=("$edit") seen_edits["$filename:$line_number"]=1 fi done # Convert the filtered edits into a JSON array recent_edits_json=$(printf '%s\n' "${filtered_edits[@]}" \| jq -R s -c 'split("\n")') fi state_json=$(jq -n --arg wd "$working_dir" --arg of "$(realpath " $open_file_info")" --argjson re "$recent_edits_json" \ '{"working_dir": $wd, "open_file": $of, "recent_edits": $re}') echo "$state_json" }
Open Source Code	Yes	1Data available at: https://github.com/microsoft/Refactor Bench
Open Datasets	Yes	To better understand the unique limitations of LM agents, we introduce Refactor Bench, a benchmark consisting of 100 large handcrafted multi-file refactoring tasks in popular open-source repositories. ... 1Data available at: https://github.com/microsoft/Refactor Bench
Dataset Splits	No	The paper introduces Refactor Bench, a benchmark with 100 tasks, and evaluates baselines by running them "on all Refactor Bench tasks". There is no explicit mention of dividing these 100 tasks into training, validation, or test sets for experimental purposes.
Hardware Specification	No	The paper mentions running "SWE-agent with gpt-4" and "claude-3.5-sonnet", and discusses "costs and rate limits on model endpoints". This implies the use of API services for large language models, rather than specifying local hardware like CPU or GPU models, memory, or dedicated computational resources.
Software Dependencies	No	The paper mentions software components like "SWE-agent", "gpt-4", "gpt-4o", "claude-3.5-sonnet", "Python", "AST" (Abstract Syntax Tree), and "unix diff". However, it does not provide specific version numbers for any of these, which is required for reproducible software dependency information.
Experiment Setup	Yes	Using a containerized framework that emulates a user file system with the target repository, we run a baseline of SWE-agent on all Refactor Bench tasks with a per instance cost limit of $10.00. ... To contextualize this performance, we have a proficient human developer attempt all the tasks within the benchmark, with a limit of 5 minutes per task using the base instructions, and they solve 87% of the test cases. ... As an self-contained example, we have a simple implementation of a state-aware interface contained within a singular agent instance. This state command tracks all it s previous edit commands and concatenates them in a separate section.