reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Practical Defect-Focused Automated Code Review

Authors: Junyi Lu, Lili Jiang, Xiaojia Li, Jianbing Fang, Fengjun Zhang, Li Yang, Chun Zuo

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach, validated on real-world merge requests from historical fault reports, achieves a 2 improvement over standard LLMs and a 10 gain over previous baselines. An ablation study further confirms the contribution of each component...
Researcher Affiliation	Collaboration	1Laboratory of Precise Computing, Institute of Software, Chinese Academy of Sciences, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China 3Kuaishou Technology, Beijing, China 4Independent Researcher 5Sinosoft Company Limited, Beijing, China. Correspondence to: Li Yang <EMAIL>.
Pseudocode	Yes	The pseudo code of our slicing algorithms is presented in Section G. ... Algorithm 1 Code Slicing ... Algorithm 2 Process AST ... Algorithm 3 Generate New Slice ... Algorithm 4 Get Contiguous Diff Segment ... Algorithm 5 Apply Slicing Algorithm ... Algorithm 6 Original Diff ... Algorithm 7 Parent Function ... Algorithm 8 Left Flow ... Algorithm 9 Full Flow
Open Source Code	Yes	Data Availability. We publicly release our codes at https: //zenodo.org/records/14779175. Details regarding their open-source status can be found in Section U.
Open Datasets	Yes	To systematically assess the performance of our system, we developed a dataset curated from the company s fault report platform. Each case in this dataset corresponds to an issue that resulted in actual company losses. ... We have released a desensitized JSON folder of fault descriptions in our Zenodo repository.1
Dataset Splits	No	To systematically assess the performance of our system, we developed a dataset curated from the company s fault report platform. Each case in this dataset corresponds to an issue that resulted in actual company losses. ... The dataset consists of 45 real-world fault reports, each corresponding to a significant issue that caused financial losses, along with the associated merge request snapshots.
Hardware Specification	Yes	All models and baselines are hosted on a server equipped with an AMD EPYC 7702 CPU and eight Nvidia A100-40G GPUs.
Software Dependencies	No	The code slicing component of our framework is implemented using Cppcheck(Marjam aki, 2024), while the LLM engines are integrated through an API supported by the v LLM framework (Kwon et al., 2023), and baselines are integrated via Flask(Organization, 2024). For large models such as LLa MA3.1-405B, we utilize an Int4 version quantized using AWQ (Lin et al., 2024).
Experiment Setup	Yes	Our filtering mechanism, integrated within the multi-role system (Section 3.3), operates by answering three key questions for each comment: Q1: Is this comment a nitpick? ... Each question is rated on a scale from 1 to 7, with 1 indicating a nitpick, fake problem, or minimal issue, and 7 indicating a severe and real issue. ... Comments with Q1 or Q2 scores of 4 or below are discarded.