reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CodeSync: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

Authors: Chenlong Wang, Zhaoyang Chu, Zhengxiang Cheng, Xuyi Yang, Kaiyue Qiu, Yao Wan, Zhou Zhao, Xuanhua Shi, Hai Jin, Dongping Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on 14 LLMs reveal that they struggle with dynamic code evolution, even with the support of advanced knowledge updating methods (e.g., DPO, ORPO, and Sim PO). Our CODESYNC lays a strong foundation for developing more effective and robust methods for real-time and large-scale code knowledge updating in the future. The experimental code is available at: https:// github.com/CGCL-codes/naturalcc/ tree/main/examples/codesync.
Researcher Affiliation	Academia	1National Engineering Research Center for Big Data Technology and Systems, Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China 2Wuhuan University 3Zhejiang University. Correspondence to: Yao Wan <EMAIL>.
Pseudocode	No	The paper describes a framework and its steps in detail (e.g., Section 2.1, 2.2, 2.3), and uses flow diagrams like Figure 3, but does not contain explicit pseudocode blocks or algorithms with structured steps.
Open Source Code	Yes	The experimental code is available at: https:// github.com/CGCL-codes/naturalcc/ tree/main/examples/codesync.
Open Datasets	Yes	Based on CODESYNC, we develop CODESYNCBENCH, an extensive benchmark for assessing LLMs ability to stay synchronized with dynamic code evolution, which includes real-world updates for 220 APIs (130 functions, 59 initializers, and 31 methods) from 6 Python libraries, along with 3,300 legacy-updated pairs of API invocation instances. The experimental code is available at: https:// github.com/CGCL-codes/naturalcc/ tree/main/examples/codesync.
Dataset Splits	Yes	Each API is associated with 15 legacy-updated invocation pairs (3,300 in total), with 5 pairs for evaluation (1,100 in total) and 10 for training (2,200 in total). Based on this, our benchmark builds 1,100 tests per evaluation task, accompanied by a training set comprising 2,200 update-aware instructions
Hardware Specification	Yes	We use Lo RA for all instruction tuning experiments (?) based on Lo RA SFT on A800 servers.
Software Dependencies	No	The paper mentions using "Python s built-in inspect module", "Python s built-in ast module", and "LLa MA-Factory (Zheng et al., 2024b)" but does not provide specific version numbers for these software components.
Experiment Setup	Yes	Table 7: RQ2. Hyperparameters for Qwen2.5-7B-Instruct. Techniques Epoch Learning Rate Warmup Ratio Preference Beta SFT 3 1.0e-4 0.1 SFT(Lo RA) 3 1.0e-4 0.1 DPO 3.5 5.0e-6 0.1 0.1 ORPO 3.5 5.0e-6 0.1 0.1 Sim PO 3.5 5.0e-6 0.1 0.1