RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code
Authors: Dhruv Gautam, Spandan Garg, Jinu Jang, Neel Sundaresan, Roshanak Zilouchian
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To better understand the unique limitations of LM agents, we introduce Refactor Bench, a benchmark consisting of 100 large handcrafted multi-file refactoring tasks in popular open-source repositories. ... Baselines on Refactor Bench reveal that current LM agents struggle with simple compositional tasks, solving only 22% of tasks with base instructions, in contrast to a human developer with short time constraints solving 87%. Through trajectory analysis, we identify various unique failure modes of LM agents, and further explore the failure mode of tracking past actions. By adapting a baseline agent to condition on representations of state, we achieve a 43.9% improvement in solving Refactor Bench tasks. |
| Researcher Affiliation | Collaboration | 1UC Berkeley 2Microsoft |
| Pseudocode | Yes | G SIMPLE SINGLE AGENT STATE-AWARE IMPLEMENTATION As an self-contained example, we have a simple implementation of a state-aware interface contained within a singular agent instance. This state command tracks all it s previous edit commands and concatenates them in a separate section. In practice and for results in the paper, we augment the state cache to relay more information about related edits by integrating parts of previous observations as well. state_command: name: state code: | state() { local working_dir="$PWD" local open_file_info="${CURRENT_FILE:-n/a}" local recent_edits_json='[]' if [ -n "$RECENT_EDITS" ]; then # Split $RECENT_EDITS into an array of edits IFS=' | ' read -r -a edits_array <<< "$RECENT_EDITS" declare -A seen_edits # Filter out duplicate filename-line_number pairs filtered_edits=() for edit in "${edits_array[@]}"; do filename=$(echo "$edit" | cut -d':' -f1) line_number=$(echo "$edit" | cut -d':' -f2) # Check if this filename:line_number pair has been seen before if [ -z "${seen_edits["$filename:$line_number"]}" ]; then filtered_edits+=("$edit") seen_edits["$filename:$line_number"]=1 fi done # Convert the filtered edits into a JSON array recent_edits_json=$(printf '%s\n' "${filtered_edits[@]}" | jq -R s -c 'split("\n")') fi state_json=$(jq -n --arg wd "$working_dir" --arg of "$(realpath " $open_file_info")" --argjson re "$recent_edits_json" \ '{"working_dir": $wd, "open_file": $of, "recent_edits": $re}') echo "$state_json" } |
| Open Source Code | Yes | 1Data available at: https://github.com/microsoft/Refactor Bench |
| Open Datasets | Yes | To better understand the unique limitations of LM agents, we introduce Refactor Bench, a benchmark consisting of 100 large handcrafted multi-file refactoring tasks in popular open-source repositories. ... 1Data available at: https://github.com/microsoft/Refactor Bench |
| Dataset Splits | No | The paper introduces Refactor Bench, a benchmark with 100 tasks, and evaluates baselines by running them "on all Refactor Bench tasks". There is no explicit mention of dividing these 100 tasks into training, validation, or test sets for experimental purposes. |
| Hardware Specification | No | The paper mentions running "SWE-agent with gpt-4" and "claude-3.5-sonnet", and discusses "costs and rate limits on model endpoints". This implies the use of API services for large language models, rather than specifying local hardware like CPU or GPU models, memory, or dedicated computational resources. |
| Software Dependencies | No | The paper mentions software components like "SWE-agent", "gpt-4", "gpt-4o", "claude-3.5-sonnet", "Python", "AST" (Abstract Syntax Tree), and "unix diff". However, it does not provide specific version numbers for any of these, which is required for reproducible software dependency information. |
| Experiment Setup | Yes | Using a containerized framework that emulates a user file system with the target repository, we run a baseline of SWE-agent on all Refactor Bench tasks with a per instance cost limit of $10.00. ... To contextualize this performance, we have a proficient human developer attempt all the tasks within the benchmark, with a limit of 5 minutes per task using the base instructions, and they solve 87% of the test cases. ... As an self-contained example, we have a simple implementation of a state-aware interface contained within a singular agent instance. This state command tracks all it s previous edit commands and concatenates them in a separate section. |