reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Let the Code LLM Edit Itself When You Edit the Code

Authors: Zhenyu He, Jun Zhang, Shengjie Luo, Jingjing Xu, Zhi Zhang, Di He

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the effectiveness of PIE through extensive experiments on the Repo Bench-C-8k dataset, utilizing Deep Seek-Coder models with 1.3B, 6.7B, and 33B parameters. Our evaluation includes three real-world coding tasks: code insertion, code deletion, and multi-place code editing. Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach across all model sizes and tasks while well approximating the model performance.
Researcher Affiliation	Collaboration	National Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University Byte Dance Inc.
Pseudocode	No	The paper describes the Positional Integrity Encoding (PIE) using mathematical equations (1)-(9) and descriptive text outlining the process, but it does not include a dedicated, structured pseudocode block or an explicitly labeled algorithm section.
Open Source Code	Yes	Code is available at https://github.com/zhenyuhe00/PIE.
Open Datasets	Yes	We validate the effectiveness of PIE through extensive experiments on the Repo Bench-C-8k dataset, utilizing Deep Seek-Coder models with 1.3B, 6.7B, and 33B parameters. Our experiments are conducted on Repo Bench-C-8k (Liu et al., 2024a). To further validate the effectiveness of PIE, we conduct experiments on code generation tasks, using Human Eval Chen et al. (2021) and its C++ version from Human Eval-X Zheng et al. (2023).
Dataset Splits	No	The paper mentions conducting experiments on the Repo Bench-C-8k test set (Table 1) and describes the construction of edited contexts for specific tasks (insertion, deletion, edition) by randomly modifying lines. However, it does not explicitly provide the training, validation, and test splits (e.g., percentages or sample counts) for the overall datasets used in the experiments.
Hardware Specification	Yes	For 1.3B and 6.7B models, all the experiments are conducted on a single NVIDIA A100 GPU. For 33B models, the time for encoding the context is conducted on two NVIDIA A100 GPUs and the full generation process is conducted on eight NVIDIA A100 GPUs.
Software Dependencies	No	The paper states: "We use Transformers (Wolf et al., 2020) as our codebase." While it mentions a software library, it does not provide a specific version number for the Transformers library or any other software dependencies with version numbers.
Experiment Setup	Yes	During inference, the greedy decoding strategy is used to deterministically generate 64 tokens. For 1.3B and 6.7B models, all the experiments are conducted on a single NVIDIA A100 GPU. For 33B models, the time for encoding the context is conducted on two NVIDIA A100 GPUs and the full generation process is conducted on eight NVIDIA A100 GPUs. The first non-comment line in the output is truncated and used as the prediction. The batch size is set to 1. All experiments are repeated three times with different seeds and the averaged scores are reported.