RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph

Authors: Siru Ouyang, Wenhao Yu, Kaixin Ma, Zilin Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hongming Zhang, Dong Yu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate REPOGRAPH on the SWE-bench by plugging it into four different methods of two lines of approaches, where REPOGRAPH substantially boosts the performance of all systems, leading to a new state-of-the-art among open-source frameworks. Our analyses also demonstrate the extensibility and flexibility of REPOGRAPH by testing on another repo-level coding benchmark, Cross Code Eval. 4 EXPERIMENTS We evaluated REPOGRAPH as a plug-in component, i.e., integrated into existing baseline models of the two aforementioned research lines to assess its performance. We use the same baseline settings and configurations when incorporating REPOGRAPH to ensure a fair comparison.
Researcher Affiliation Collaboration Siru Ouyang1 , Wenhao Yu2, Kaixin Ma2, Zilin Xiao3, Zhihan Zhang4, Mengzhao Jia4, Jiawei Han1, Hongming Zhang2, Dong Yu2 1 University of Illinois Urbana-Champaign, 2 Tencent AI Seattle Lab 3 Rice University, 4 University of Notre Dame EMAIL
Pseudocode No The paper describes the construction process and utility of REPOGRAPH in detail across sections 3.1 and 3.2, and visually illustrates it in Figure 2. However, it does not present this information in a structured pseudocode or algorithm block.
Open Source Code Yes Our code is available at https://github.com/ozyyshr/RepoGraph
Open Datasets Yes Dataset. We test REPOGRAPH in SWE-bench-Lite2. Each problem in the dataset requires submitting a patch to solve the underlying issue described in the input issue description. ... We also test REPOGRAPH on Cross Code Eval to verify its transferability to general coding tasks that require repository-level code understanding.
Dataset Splits No We test REPOGRAPH in SWE-bench-Lite2. Each problem in the dataset requires submitting a patch to solve the underlying issue described in the input issue description. ... We also test REPOGRAPH on Cross Code Eval to verify its transferability to general coding tasks that require repository-level code understanding. The paper mentions using the "SWE-bench-Lite test set" and "Cross Code Eval" but does not specify the train/validation/test split ratios, absolute sample counts for splits, or detailed splitting methodology within the main text.
Hardware Specification No All evaluation processes are performed in a containerized Docker environment 3, ensuring stability and reproducibility, made possible through contributions from the open-source community. The paper mentions using a Docker environment but does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies Yes We used GPT-4o (2024-05-13) and GPT-4-Turbo (gpt-4-1106-preview) from Open AI for evaluation and analyses in our experiments. All evaluation processes are performed in a containerized Docker environment. Additionally, Section 3.1 states: "For each code file, we utilize tree-sitter 1 to parse the code, leveraging its Abstract Syntax Tree (AST) framework." and footnote 1 links to "https://pypi.org/project/tree-sitter-languages/" And in Appendix C.1: "We conduct additional experiments on SWE-Bench-Lite with Claude-3.5-Sonnet, and the results are shown in Table 6."
Experiment Setup No We use the same baseline settings and configurations when incorporating REPOGRAPH to ensure a fair comparison. Detailed implementations and prompts used can be found in Appendix A. Detailed implementations of these variants and prompts used can be found in Appendix B.1. The main text of the paper refers to appendices for detailed implementations and prompts but does not explicitly provide specific hyperparameter values, model initialization details, dropout rates, or training schedules.