OrcaLoca: An LLM Agent Framework for Software Issue Localization

Authors: Zhongming Yu, Hejia Zhang, Yujie Zhao, Hanxian Huang, Matrix Yao, Ke Ding, Jishen Zhao

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that ORCALOCA becomes the new open-source stateof-the-art (SOTA) in function match rate (65.33%) on SWE-bench Lite. It also improves the final resolved rate of an open-source framework by 6.33 percentage points through its patch generation integration.
Researcher Affiliation Collaboration 1University of California, San Diego, USA 2Intel Corporation. Correspondence to: Jishen Zhao <EMAIL>.
Pseudocode Yes To have a better understanding of Figure 2, we provide a core algorithm pseudocode in Algorithm 1. It summarizes the essential components discussed in Sections 3.2, 3.3, and 3.4.
Open Source Code Yes ORCALOCA is available at https: //github.com/fishmingyu/Orca Loca.
Open Datasets Yes SWE-bench (Jimenez et al., 2023) is a widely used dataset for evaluating the ability of LLM systems to address real-world software engineering challenges. It comprises 2,294 task instances derived from 12 popular Python repositories, where each task requires a patch to resolve the issue described in its corresponding Git Hub issue.
Dataset Splits No The paper describes subsets of the SWE-bench dataset, such as SWE-bench Lite (300 instances), SWE-bench Verified (500 instances), and SWE-bench Common (93 instances), used for evaluation. However, it does not specify explicit training/test/validation splits for model development or how these instances are partitioned for their own experimental processes beyond being evaluation benchmarks.
Hardware Specification No This research was partially conducted using computational resources provided by the Google Cloud Platform (GCP) Credits Award. However, specific hardware details like GPU/CPU models or memory amounts are not provided.
Software Dependencies Yes ORCALOCA is built on the Llama Index framework (Liu, 2022), which supports various foundation models. For our experiments, we used Claude-3.5-Sonnet-20241022 (Anthropic, 2024) as the underlying model, with a sampling temperature set to 0.1 to prioritize deterministic results. [...] We then generate and execute a reproduction snippet using an LLM and record its execution trace with Viz Tracer (Gao, 2025).
Experiment Setup Yes For our experiments, we used Claude-3.5-Sonnet-20241022 (Anthropic, 2024) as the underlying model, with a sampling temperature set to 0.1 to prioritize deterministic results. For the top-k values used in action decomposition (Section 3.3), we set k = 3 for class decomposition and k = 2 for file decomposition. In the context pruning (Section 3.4), the context window size is configured to retain 12 entries (top-k). [...] For the repair process, we generated 40 patches (1 at a temperature of 0 and the rest at 0.8) with the str_replace_format argument set. [...] Regression tests were filtered with a temperature of 0, while reproduction tests were generated using 40 samples (1 at a temperature of 0 and the rest at 0.8).