reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

EditLord: Learning Code Transformation Rules for Code Editing

Authors: Weichen Li, Albert Jan, Baishakhi Ray, Junfeng Yang, Chengzhi Mao, Kexin Pei

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate EDITLORD on the three critical software engineering and security code editing tasks... EDITLORD outperforms the state-of-the-art code editing techniques by 23.3%, 12.7%, and 27.6%, respectively, across multiple code LMs. ... We compare EDITLORD to zero-shot prompting, chain-of-thought prompting (Co T), and finetuning (state-of-the-art baselines) on three tasks: performance optimization, decompilation, and security hardening. As illustrated in Table 1, EDITLORD outperforms the state-of-the-art baselines across all tasks and models...
Researcher Affiliation	Academia	Weichen Li 1 Albert Jan 2 Baishakhi Ray 2 Junfeng Yang 2 Chengzhi Mao 3 Kexin Pei 1 1The University of Chicago 2Columbia University 3Rutgers University.
Pseudocode	Yes	Algorithm 1 Iterative Meta-Rule Set Refinement
Open Source Code	No	The paper mentions using "open-source Deep Seek-Coder 1.3B and 6.7B" which are models used by the authors, not code released by the authors for their methodology. There is no explicit statement or link providing access to the source code for EDITLORD's implementation.
Open Datasets	Yes	We use the HQ (high quality) dataset from Shypula et al. (2024) for training and evaluation. ... We obtain these decompiled code samples using the off-the-shelf decompiler Ghidra (Agency, 2019). ... We follow Tan et al. (2024) to randomly sample original code snippets from Angha Bench (Da Silva et al., 2021) ... We obtain the vulnerable and secure code pairs from SVEN (He & Vechev, 2023) for training and validation, and evaluate EDITLORD on a strictly unseen testing set, CWEval (Peng et al., 2025a).
Dataset Splits	Yes	The dataset consists of 4,085 training (slow and fast code) pairs, 2,544 validation samples, and 978 testing samples. ... Overall, our dataset set consists of 8,567 (machine-decompiled code, original source code) training code pairs, 834 validation samples, and 131 testing samples with test cases.
Hardware Specification	Yes	We choose models that can be full-parameter finetuned on our local hardware (2x4 Nvidia L40S GPUs)
Software Dependencies	No	The paper mentions using 'Deep Seek-Coder 1.3B and 6.7B' and 'GPT-4o mini' as language models, but does not provide specific version numbers for software dependencies such as libraries or frameworks (e.g., PyTorch, TensorFlow) used for their implementation or training.
Experiment Setup	Yes	To finetune Deep Seek-Coder, we use a default batch size of 32, a learning rate of 1e-5, and 4,000 context lengths for both the input and output tokens. The models are optimized using Adam W and trained for a fixed number of 10 epochs, and we use the model checkpoint that achieves the best validation loss for inference. To finetune GPT-4o mini, we train for only one epoch. At the inference stage, we set the temperature to 0.7 and use the model s default window size, i.e., 16K for Deep Seek-Coder and 128K for GPT-4o-mini.