reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph

Authors: Siru Ouyang, Wenhao Yu, Kaixin Ma, Zilin Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hongming Zhang, Dong Yu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate REPOGRAPH on the SWE-bench by plugging it into four different methods of two lines of approaches, where REPOGRAPH substantially boosts the performance of all systems, leading to a new state-of-the-art among open-source frameworks. Our analyses also demonstrate the extensibility and flexibility of REPOGRAPH by testing on another repo-level coding benchmark, Cross Code Eval. 4 EXPERIMENTS We evaluated REPOGRAPH as a plug-in component, i.e., integrated into existing baseline models of the two aforementioned research lines to assess its performance. We use the same baseline settings and configurations when incorporating REPOGRAPH to ensure a fair comparison.
Researcher Affiliation	Collaboration	Siru Ouyang1 , Wenhao Yu2, Kaixin Ma2, Zilin Xiao3, Zhihan Zhang4, Mengzhao Jia4, Jiawei Han1, Hongming Zhang2, Dong Yu2 1 University of Illinois Urbana-Champaign, 2 Tencent AI Seattle Lab 3 Rice University, 4 University of Notre Dame EMAIL
Pseudocode	No	The paper describes the construction process and utility of REPOGRAPH in detail across sections 3.1 and 3.2, and visually illustrates it in Figure 2. However, it does not present this information in a structured pseudocode or algorithm block.
Open Source Code	Yes	Our code is available at https://github.com/ozyyshr/RepoGraph
Open Datasets	Yes	Dataset. We test REPOGRAPH in SWE-bench-Lite2. Each problem in the dataset requires submitting a patch to solve the underlying issue described in the input issue description. ... We also test REPOGRAPH on Cross Code Eval to verify its transferability to general coding tasks that require repository-level code understanding.
Dataset Splits	No	We test REPOGRAPH in SWE-bench-Lite2. Each problem in the dataset requires submitting a patch to solve the underlying issue described in the input issue description. ... We also test REPOGRAPH on Cross Code Eval to verify its transferability to general coding tasks that require repository-level code understanding. The paper mentions using the "SWE-bench-Lite test set" and "Cross Code Eval" but does not specify the train/validation/test split ratios, absolute sample counts for splits, or detailed splitting methodology within the main text.
Hardware Specification	No	All evaluation processes are performed in a containerized Docker environment 3, ensuring stability and reproducibility, made possible through contributions from the open-source community. The paper mentions using a Docker environment but does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies	Yes	We used GPT-4o (2024-05-13) and GPT-4-Turbo (gpt-4-1106-preview) from Open AI for evaluation and analyses in our experiments. All evaluation processes are performed in a containerized Docker environment. Additionally, Section 3.1 states: "For each code file, we utilize tree-sitter 1 to parse the code, leveraging its Abstract Syntax Tree (AST) framework." and footnote 1 links to "https://pypi.org/project/tree-sitter-languages/" And in Appendix C.1: "We conduct additional experiments on SWE-Bench-Lite with Claude-3.5-Sonnet, and the results are shown in Table 6."
Experiment Setup	No	We use the same baseline settings and configurations when incorporating REPOGRAPH to ensure a fair comparison. Detailed implementations and prompts used can be found in Appendix A. Detailed implementations of these variants and prompts used can be found in Appendix B.1. The main text of the paper refers to appendices for detailed implementations and prompts but does not explicitly provide specific hyperparameter values, model initialization details, dropout rates, or training schedules.