reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OrcaLoca: An LLM Agent Framework for Software Issue Localization

Authors: Zhongming Yu, Hejia Zhang, Yujie Zhao, Hanxian Huang, Matrix Yao, Ke Ding, Jishen Zhao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that ORCALOCA becomes the new open-source stateof-the-art (SOTA) in function match rate (65.33%) on SWE-bench Lite. It also improves the final resolved rate of an open-source framework by 6.33 percentage points through its patch generation integration.
Researcher Affiliation	Collaboration	1University of California, San Diego, USA 2Intel Corporation. Correspondence to: Jishen Zhao <EMAIL>.
Pseudocode	Yes	To have a better understanding of Figure 2, we provide a core algorithm pseudocode in Algorithm 1. It summarizes the essential components discussed in Sections 3.2, 3.3, and 3.4.
Open Source Code	Yes	ORCALOCA is available at https: //github.com/fishmingyu/Orca Loca.
Open Datasets	Yes	SWE-bench (Jimenez et al., 2023) is a widely used dataset for evaluating the ability of LLM systems to address real-world software engineering challenges. It comprises 2,294 task instances derived from 12 popular Python repositories, where each task requires a patch to resolve the issue described in its corresponding Git Hub issue.
Dataset Splits	No	The paper describes subsets of the SWE-bench dataset, such as SWE-bench Lite (300 instances), SWE-bench Verified (500 instances), and SWE-bench Common (93 instances), used for evaluation. However, it does not specify explicit training/test/validation splits for model development or how these instances are partitioned for their own experimental processes beyond being evaluation benchmarks.
Hardware Specification	No	This research was partially conducted using computational resources provided by the Google Cloud Platform (GCP) Credits Award. However, specific hardware details like GPU/CPU models or memory amounts are not provided.
Software Dependencies	Yes	ORCALOCA is built on the Llama Index framework (Liu, 2022), which supports various foundation models. For our experiments, we used Claude-3.5-Sonnet-20241022 (Anthropic, 2024) as the underlying model, with a sampling temperature set to 0.1 to prioritize deterministic results. [...] We then generate and execute a reproduction snippet using an LLM and record its execution trace with Viz Tracer (Gao, 2025).
Experiment Setup	Yes	For our experiments, we used Claude-3.5-Sonnet-20241022 (Anthropic, 2024) as the underlying model, with a sampling temperature set to 0.1 to prioritize deterministic results. For the top-k values used in action decomposition (Section 3.3), we set k = 3 for class decomposition and k = 2 for file decomposition. In the context pruning (Section 3.4), the context window size is configured to retain 12 entries (top-k). [...] For the repair process, we generated 40 patches (1 at a temperature of 0 and the rest at 0.8) with the str_replace_format argument set. [...] Regression tests were filtered with a temperature of 0, while reproduction tests were generated using 40 samples (1 at a temperature of 0 and the rest at 0.8).