EditLord: Learning Code Transformation Rules for Code Editing

Authors: Weichen Li, Albert Jan, Baishakhi Ray, Junfeng Yang, Chengzhi Mao, Kexin Pei

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate EDITLORD on the three critical software engineering and security code editing tasks... EDITLORD outperforms the state-of-the-art code editing techniques by 23.3%, 12.7%, and 27.6%, respectively, across multiple code LMs. ... We compare EDITLORD to zero-shot prompting, chain-of-thought prompting (Co T), and finetuning (state-of-the-art baselines) on three tasks: performance optimization, decompilation, and security hardening. As illustrated in Table 1, EDITLORD outperforms the state-of-the-art baselines across all tasks and models...
Researcher Affiliation Academia Weichen Li 1 Albert Jan 2 Baishakhi Ray 2 Junfeng Yang 2 Chengzhi Mao 3 Kexin Pei 1 1The University of Chicago 2Columbia University 3Rutgers University.
Pseudocode Yes Algorithm 1 Iterative Meta-Rule Set Refinement
Open Source Code No The paper mentions using "open-source Deep Seek-Coder 1.3B and 6.7B" which are models used by the authors, not code released by the authors for their methodology. There is no explicit statement or link providing access to the source code for EDITLORD's implementation.
Open Datasets Yes We use the HQ (high quality) dataset from Shypula et al. (2024) for training and evaluation. ... We obtain these decompiled code samples using the off-the-shelf decompiler Ghidra (Agency, 2019). ... We follow Tan et al. (2024) to randomly sample original code snippets from Angha Bench (Da Silva et al., 2021) ... We obtain the vulnerable and secure code pairs from SVEN (He & Vechev, 2023) for training and validation, and evaluate EDITLORD on a strictly unseen testing set, CWEval (Peng et al., 2025a).
Dataset Splits Yes The dataset consists of 4,085 training (slow and fast code) pairs, 2,544 validation samples, and 978 testing samples. ... Overall, our dataset set consists of 8,567 (machine-decompiled code, original source code) training code pairs, 834 validation samples, and 131 testing samples with test cases.
Hardware Specification Yes We choose models that can be full-parameter finetuned on our local hardware (2x4 Nvidia L40S GPUs)
Software Dependencies No The paper mentions using 'Deep Seek-Coder 1.3B and 6.7B' and 'GPT-4o mini' as language models, but does not provide specific version numbers for software dependencies such as libraries or frameworks (e.g., PyTorch, TensorFlow) used for their implementation or training.
Experiment Setup Yes To finetune Deep Seek-Coder, we use a default batch size of 32, a learning rate of 1e-5, and 4,000 context lengths for both the input and output tokens. The models are optimized using Adam W and trained for a fixed number of 10 epochs, and we use the model checkpoint that achieves the best validation loss for inference. To finetune GPT-4o mini, we train for only one epoch. At the inference stage, we set the temperature to 0.7 and use the model s default window size, i.e., 16K for Deep Seek-Coder and 128K for GPT-4o-mini.