Let the Code LLM Edit Itself When You Edit the Code
Authors: Zhenyu He, Jun Zhang, Shengjie Luo, Jingjing Xu, Zhi Zhang, Di He
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the effectiveness of PIE through extensive experiments on the Repo Bench-C-8k dataset, utilizing Deep Seek-Coder models with 1.3B, 6.7B, and 33B parameters. Our evaluation includes three real-world coding tasks: code insertion, code deletion, and multi-place code editing. Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach across all model sizes and tasks while well approximating the model performance. |
| Researcher Affiliation | Collaboration | National Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University Byte Dance Inc. |
| Pseudocode | No | The paper describes the Positional Integrity Encoding (PIE) using mathematical equations (1)-(9) and descriptive text outlining the process, but it does not include a dedicated, structured pseudocode block or an explicitly labeled algorithm section. |
| Open Source Code | Yes | Code is available at https://github.com/zhenyuhe00/PIE. |
| Open Datasets | Yes | We validate the effectiveness of PIE through extensive experiments on the Repo Bench-C-8k dataset, utilizing Deep Seek-Coder models with 1.3B, 6.7B, and 33B parameters. Our experiments are conducted on Repo Bench-C-8k (Liu et al., 2024a). To further validate the effectiveness of PIE, we conduct experiments on code generation tasks, using Human Eval Chen et al. (2021) and its C++ version from Human Eval-X Zheng et al. (2023). |
| Dataset Splits | No | The paper mentions conducting experiments on the Repo Bench-C-8k test set (Table 1) and describes the construction of edited contexts for specific tasks (insertion, deletion, edition) by randomly modifying lines. However, it does not explicitly provide the training, validation, and test splits (e.g., percentages or sample counts) for the overall datasets used in the experiments. |
| Hardware Specification | Yes | For 1.3B and 6.7B models, all the experiments are conducted on a single NVIDIA A100 GPU. For 33B models, the time for encoding the context is conducted on two NVIDIA A100 GPUs and the full generation process is conducted on eight NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper states: "We use Transformers (Wolf et al., 2020) as our codebase." While it mentions a software library, it does not provide a specific version number for the Transformers library or any other software dependencies with version numbers. |
| Experiment Setup | Yes | During inference, the greedy decoding strategy is used to deterministically generate 64 tokens. For 1.3B and 6.7B models, all the experiments are conducted on a single NVIDIA A100 GPU. For 33B models, the time for encoding the context is conducted on two NVIDIA A100 GPUs and the full generation process is conducted on eight NVIDIA A100 GPUs. The first non-comment line in the output is truncated and used as the prediction. The batch size is set to 1. All experiments are repeated three times with different seeds and the averaged scores are reported. |