CodeSync: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

Authors: Chenlong Wang, Zhaoyang Chu, Zhengxiang Cheng, Xuyi Yang, Kaiyue Qiu, Yao Wan, Zhou Zhao, Xuanhua Shi, Hai Jin, Dongping Chen

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on 14 LLMs reveal that they struggle with dynamic code evolution, even with the support of advanced knowledge updating methods (e.g., DPO, ORPO, and Sim PO). Our CODESYNC lays a strong foundation for developing more effective and robust methods for real-time and large-scale code knowledge updating in the future. The experimental code is available at: https:// github.com/CGCL-codes/naturalcc/ tree/main/examples/codesync.
Researcher Affiliation Academia 1National Engineering Research Center for Big Data Technology and Systems, Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China 2Wuhuan University 3Zhejiang University. Correspondence to: Yao Wan <EMAIL>.
Pseudocode No The paper describes a framework and its steps in detail (e.g., Section 2.1, 2.2, 2.3), and uses flow diagrams like Figure 3, but does not contain explicit pseudocode blocks or algorithms with structured steps.
Open Source Code Yes The experimental code is available at: https:// github.com/CGCL-codes/naturalcc/ tree/main/examples/codesync.
Open Datasets Yes Based on CODESYNC, we develop CODESYNCBENCH, an extensive benchmark for assessing LLMs ability to stay synchronized with dynamic code evolution, which includes real-world updates for 220 APIs (130 functions, 59 initializers, and 31 methods) from 6 Python libraries, along with 3,300 legacy-updated pairs of API invocation instances. The experimental code is available at: https:// github.com/CGCL-codes/naturalcc/ tree/main/examples/codesync.
Dataset Splits Yes Each API is associated with 15 legacy-updated invocation pairs (3,300 in total), with 5 pairs for evaluation (1,100 in total) and 10 for training (2,200 in total). Based on this, our benchmark builds 1,100 tests per evaluation task, accompanied by a training set comprising 2,200 update-aware instructions
Hardware Specification Yes We use Lo RA for all instruction tuning experiments (?) based on Lo RA SFT on A800 servers.
Software Dependencies No The paper mentions using "Python s built-in inspect module", "Python s built-in ast module", and "LLa MA-Factory (Zheng et al., 2024b)" but does not provide specific version numbers for these software components.
Experiment Setup Yes Table 7: RQ2. Hyperparameters for Qwen2.5-7B-Instruct. Techniques Epoch Learning Rate Warmup Ratio Preference Beta SFT 3 1.0e-4 0.1 SFT(Lo RA) 3 1.0e-4 0.1 DPO 3.5 5.0e-6 0.1 0.1 ORPO 3.5 5.0e-6 0.1 0.1 Sim PO 3.5 5.0e-6 0.1 0.1