Training Software Engineering Agents and Verifiers with SWE-Gym
Authors: Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use SWEGym to train language model based SWE agents, and achieve up to 19% absolute gains in resolution rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new state-of-the-art for open-weight SWE agents. |
| Researcher Affiliation | Collaboration | Jiayi Pan 1 Xingyao Wang 2 Graham Neubig 3 Navdeep Jaitly 4 Heng Ji 2 Alane Suhr 1 Yizhe Zhang 4 1UC Berkeley 2UIUC 3CMU 4Apple. Correspondence to: Jiayi Pan <EMAIL>, Xingyao Wang <EMAIL>, Alane Suhr <EMAIL>, Yizhe Zhang <EMAIL>. |
| Pseudocode | Yes | B.5. Moatless Tools ORM Prompt The following is a pseudo-code that generates a prompt for Moatless Tools Verifier (ORM), which is modified from (Zhang et al., 2024a). B.6. Open Hands ORM Prompt The following is a pseudo-code that generates a prompt for Open Hands Verifier (ORM). |
| Open Source Code | Yes | To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories. Lastly, we open source all artifacts of the project, including dataset, model weights, agent trajectory data and the training pipeline. |
| Open Datasets | Yes | To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories. Table 1: SWE-Gym is the first publicly available training environment combining real-world SWE tasks from Git Hub issues with pre-installed dependencies and executable test verification. |
| Dataset Splits | Yes | We use the standard SWE agent benchmarks SWE-Bench Lite and Verified (Jimenez et al., 2024) for evaluation. Table 1: ... # Instances (total) # Instances (train) SWE-Bench (train) (Jimenez et al., 2024) 19,008 19,008 SWE-Bench (test) (Jimenez et al., 2024) 2,294 0 SWE-Gym 2,438 2,438 SWE-Gym Lite, comprising 230 instances. |
| Hardware Specification | Yes | We fine-tuned both 7B, 14B, and 32B variant of the model, and experiments were performed with 2-8x NVIDIA H100 80G GPU on modal (Modal, 2024). For the 32B model, we use Unsloth (Unsloth Team, 2024) with a single H100 GPU for Lo RA fine-tuning. |
| Software Dependencies | No | We use torchtune (Py Torch Team, 2024) for full parameter fine-tuning with a learning rate of 1e-4, maximum 5 epochs, global batch size of 8, max context length of 32768. We fine-tuned both 7B, 14B, and 32B variant of the model, and experiments were performed with 2-8x NVIDIA H100 80G GPU on modal (Modal, 2024). For the 32B model, we use Unsloth (Unsloth Team, 2024) with a single H100 GPU for Lo RA fine-tuning. |
| Experiment Setup | Yes | We use torchtune (Py Torch Team, 2024) for full parameter fine-tuning with a learning rate of 1e-4, maximum 5 epochs, global batch size of 8, max context length of 32768. For experiments with the 7B model, we use torchtune to train the policy model with full-finetuning using 4 H100 GPUs. We set batch size to 8, learning rate to 2e-5, and train for 5 epochs. For the 32B model, we use Unsloth (Unsloth Team, 2024) with a single H100 GPU for Lo RA fine-tuning. We set the number of epochs to 5, batch size to 8, Lo RA rank to 64, and learning rate to 5e-4. |