reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Training Software Engineering Agents and Verifiers with SWE-Gym

Authors: Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use SWEGym to train language model based SWE agents, and achieve up to 19% absolute gains in resolution rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new state-of-the-art for open-weight SWE agents.
Researcher Affiliation	Collaboration	Jiayi Pan 1 Xingyao Wang 2 Graham Neubig 3 Navdeep Jaitly 4 Heng Ji 2 Alane Suhr 1 Yizhe Zhang 4 1UC Berkeley 2UIUC 3CMU 4Apple. Correspondence to: Jiayi Pan <EMAIL>, Xingyao Wang <EMAIL>, Alane Suhr <EMAIL>, Yizhe Zhang <EMAIL>.
Pseudocode	Yes	B.5. Moatless Tools ORM Prompt The following is a pseudo-code that generates a prompt for Moatless Tools Verifier (ORM), which is modified from (Zhang et al., 2024a). B.6. Open Hands ORM Prompt The following is a pseudo-code that generates a prompt for Open Hands Verifier (ORM).
Open Source Code	Yes	To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories. Lastly, we open source all artifacts of the project, including dataset, model weights, agent trajectory data and the training pipeline.
Open Datasets	Yes	To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories. Table 1: SWE-Gym is the first publicly available training environment combining real-world SWE tasks from Git Hub issues with pre-installed dependencies and executable test verification.
Dataset Splits	Yes	We use the standard SWE agent benchmarks SWE-Bench Lite and Verified (Jimenez et al., 2024) for evaluation. Table 1: ... # Instances (total) # Instances (train) SWE-Bench (train) (Jimenez et al., 2024) 19,008 19,008 SWE-Bench (test) (Jimenez et al., 2024) 2,294 0 SWE-Gym 2,438 2,438 SWE-Gym Lite, comprising 230 instances.
Hardware Specification	Yes	We fine-tuned both 7B, 14B, and 32B variant of the model, and experiments were performed with 2-8x NVIDIA H100 80G GPU on modal (Modal, 2024). For the 32B model, we use Unsloth (Unsloth Team, 2024) with a single H100 GPU for Lo RA fine-tuning.
Software Dependencies	No	We use torchtune (Py Torch Team, 2024) for full parameter fine-tuning with a learning rate of 1e-4, maximum 5 epochs, global batch size of 8, max context length of 32768. We fine-tuned both 7B, 14B, and 32B variant of the model, and experiments were performed with 2-8x NVIDIA H100 80G GPU on modal (Modal, 2024). For the 32B model, we use Unsloth (Unsloth Team, 2024) with a single H100 GPU for Lo RA fine-tuning.
Experiment Setup	Yes	We use torchtune (Py Torch Team, 2024) for full parameter fine-tuning with a learning rate of 1e-4, maximum 5 epochs, global batch size of 8, max context length of 32768. For experiments with the 7B model, we use torchtune to train the policy model with full-finetuning using 4 H100 GPUs. We set batch size to 8, learning rate to 2e-5, and train for 5 epochs. For the 32B model, we use Unsloth (Unsloth Team, 2024) with a single H100 GPU for Lo RA fine-tuning. We set the number of epochs to 5, batch size to 8, Lo RA rank to 64, and learning rate to 5e-4.