reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Guided Search Strategies in Non-Serializable Environments with Applications to Software Engineering Agents

Authors: Karina Zainullina, Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Daria Litvintseva, Simon Karasik, Filipp Fisin, Sergei Skvortsov, Maksim Nekrashevich, Anton Shevtsov, Boris Yangel

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On the SWE-bench Verified benchmark, a key testbed for agentic software engineering, we find these methods to double the average success rate of a fine-tuned Qwen-72B model, achieving 40.8%, the new state-of-the-art for open-weights models. Additionally, we show that these techniques are transferable to more advanced closed models, yielding similar improvements with GPT-4o.
Researcher Affiliation	Industry	Karina Zainullina * 1 Alexander Golubev * 1 Maria Trofimova * 1 Sergei Polezhaev 1 Ibragim Badertdinov 1 Daria Litvintseva 1 Simon Karasik 1 Filipp Fisin 1 Sergei Skvortsov 1 Maksim Nekrashevich 1 Anton Shevtsov 1 Boris Yangel 1 *Equal contribution 1Nebius. Correspondence to: Boris Yangel <EMAIL>.
Pseudocode	Yes	Algorithm 1 Sample-based 1-step lookahead 1: Input: base policy π, number of action candidates K, critic model Qπ, environment E 2: Initialize s E.init() 3: repeat 4: Initialize list Of Actions [] 5: Initialize list Of QValues [] 6: for k = 1 to K do 7: a sample from π(a \| s) 8: list Of QValues.append(Qπ(s, a )) 9: list Of Actions.append(a ) 10: end for 11: a list Of Actions[arg max(list Of QValues)] 12: s E.step(s, a ) 13: until s is terminal 14: return s
Open Source Code	No	The paper does not contain an explicit statement about releasing its source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets	Yes	We utilize the SWE-agent scaffolding (Yang et al., 2024) to evaluate guided search techniques described in Section 3 on the SWE-bench Verified dataset (Jimenez et al., 2024). Our training issue set consists of 6,500 issue pull request pairs from 2,500 Python repositories, carefully filtered to avoid leakage of the SWE-bench test set, and supplemented with the SWE-bench development set.
Dataset Splits	No	To balance comprehensive evaluation against computational constraints, we created Verified-50, a dataset of 50 randomly selected problems from SWE-bench Verified. This dataset allows us to compute an unbiased estimate of the success rate on the full SWE-bench Verified together with the estimation error at the cost of a single run on the full evaluation set. Our training issue set consists of 6,500 issue pull request pairs from 2,500 Python repositories, carefully filtered to avoid leakage of the SWE-bench test set, and supplemented with the SWE-bench development set. Update the current policy by fine-tuning on a curated subset of successful trajectories collected so far. The paper mentions creating a randomly selected subset for evaluation (Verified-50) and using a 'curated subset' for fine-tuning, but does not provide specific percentages or absolute counts for training, validation, or test splits for its own collected data, nor does it detail the methodology for these splits beyond 'randomly selected' or 'curated'.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments or training the models.
Software Dependencies	No	We use Qwen2.5-72B (Bai et al., 2023) both as the initial collection policy and as the starting point for every finetuning iteration. We use LLa MA3.1-70B (Dubey et al., 2024) to initialize critic training. The paper mentions specific LLM models used (Qwen2.5-72B, LLaMA3.1-70B) but does not list general software dependencies or libraries with version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed for replication.
Experiment Setup	Yes	Table 3. Hyperparameters used in experiments Hyperparameter Critic Base policy Optimizer Adam W Adam W Warmup steps 7 7 Training steps 459 215 Learning rate value 2 10 6 4 10 6 Learning rate schedule cosine cosine Batch size 128 128 Number of epochs 4 6 Weight decay 0.1 0.1 Sequence length 32768 32768 Given our sparse reward setting, we use TD(λ) (Sutton & Barto, 1998) to compute training targets. ... Figure 4 illustrates that λ = 0.7 yields the highest success rate... For 1-step lookahead, we utilize the optimal search parameters identified in Subsection 4.4: policy sampling temperature T = 0.9 and number of candidates K = 4.