Let the Code LLM Edit Itself When You Edit the Code

Authors: Zhenyu He, Jun Zhang, Shengjie Luo, Jingjing Xu, Zhi Zhang, Di He

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the effectiveness of PIE through extensive experiments on the Repo Bench-C-8k dataset, utilizing Deep Seek-Coder models with 1.3B, 6.7B, and 33B parameters. Our evaluation includes three real-world coding tasks: code insertion, code deletion, and multi-place code editing. Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach across all model sizes and tasks while well approximating the model performance.
Researcher Affiliation Collaboration National Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University Byte Dance Inc.
Pseudocode No The paper describes the Positional Integrity Encoding (PIE) using mathematical equations (1)-(9) and descriptive text outlining the process, but it does not include a dedicated, structured pseudocode block or an explicitly labeled algorithm section.
Open Source Code Yes Code is available at https://github.com/zhenyuhe00/PIE.
Open Datasets Yes We validate the effectiveness of PIE through extensive experiments on the Repo Bench-C-8k dataset, utilizing Deep Seek-Coder models with 1.3B, 6.7B, and 33B parameters. Our experiments are conducted on Repo Bench-C-8k (Liu et al., 2024a). To further validate the effectiveness of PIE, we conduct experiments on code generation tasks, using Human Eval Chen et al. (2021) and its C++ version from Human Eval-X Zheng et al. (2023).
Dataset Splits No The paper mentions conducting experiments on the Repo Bench-C-8k test set (Table 1) and describes the construction of edited contexts for specific tasks (insertion, deletion, edition) by randomly modifying lines. However, it does not explicitly provide the training, validation, and test splits (e.g., percentages or sample counts) for the overall datasets used in the experiments.
Hardware Specification Yes For 1.3B and 6.7B models, all the experiments are conducted on a single NVIDIA A100 GPU. For 33B models, the time for encoding the context is conducted on two NVIDIA A100 GPUs and the full generation process is conducted on eight NVIDIA A100 GPUs.
Software Dependencies No The paper states: "We use Transformers (Wolf et al., 2020) as our codebase." While it mentions a software library, it does not provide a specific version number for the Transformers library or any other software dependencies with version numbers.
Experiment Setup Yes During inference, the greedy decoding strategy is used to deterministically generate 64 tokens. For 1.3B and 6.7B models, all the experiments are conducted on a single NVIDIA A100 GPU. For 33B models, the time for encoding the context is conducted on two NVIDIA A100 GPUs and the full generation process is conducted on eight NVIDIA A100 GPUs. The first non-comment line in the output is truncated and used as the prediction. The batch size is set to 1. All experiments are repeated three times with different seeds and the averaged scores are reported.