NextCoder: Robust Adaptation of Code LMs to Diverse Code Edits

Authors: Tushar Aggarwal, Swayam Singh, Abhijeet Awasthi, Aditya Kanade, Nagarajan Natarajan

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using our approach, we obtain a new series of models Next Coder (adapted from Qwen Coder-2.5) that achieves strong results on five code-editing benchmarks, outperforming comparable size models and even several larger ones. We show the generality of our approach on two model families (Deep Seek Coder and Qwen Coder), compare against other fine-tuning approaches, and demonstrate robustness by showing retention of code generation and general problemsolving abilities post adaptation.
Researcher Affiliation Industry 1Microsoft Research India. Correspondence to: Tushar Aggarwal <EMAIL>, Swayam Singh <EMAIL>, Abhijeet Awasthi <EMAIL>, Aditya Kanade <EMAIL>, Nagarajan Natarajan <EMAIL>.
Pseudocode Yes Algorithm 1 Sele KT: Selective Knowledge Transfer Require: Base LM weights θbase, training data D, epochs E, periodicity M, sparsity α. Ensure: Final fine-tuned weights θFT. 1: Initialize θ ← θbase. 2: for epoch e = 1 to E do 3: for each minibatch D[s] do 4: θ ← Train Step(θ, D[s]) [Dense Gradients] 5: if s mod M = 0 then 6: Compute task vector: τ ← θ − θbase 7: Select top-αN parameters: (1, i ∈ top-k(|τ|, α N ) 0, otherwise 8: θ ← θbase + γ τ [Sparse Projection] 9: end if 10: end for 11: end for 12: return θ as θFT.
Open Source Code Yes We opensource the models, synthetic dataset, and implementation at aka.ms/nextcoder.
Open Datasets Yes We opensource the models, synthetic dataset, and implementation at aka.ms/nextcoder.
Dataset Splits No In addition to the synthetic data (Table 1), we used 127K instances from Commit Pack FT to fine-tune our models. The paper does not specify explicit training/validation/test splits for this combined dataset.
Hardware Specification Yes For fine-tuning and inference, we use 8 NVIDIA H100 GPUs, each with 80GB of VRAM. For data generation using GPT-4o (version 2024-05-13), we use the Open AI API. Following Singhal et al. (2024), we perform run-time evaluations for No Fun Eval on an Azure NC16 VM (NC16).
Software Dependencies No The paper mentions using 'Adam W optimizer' and 'Warmup LR scheduler', 'Deep Speed' (Rajbhandari et al., 2020) and 'bfloat16' for memory optimizations, but does not provide specific version numbers for these software components or libraries.
Experiment Setup Yes We fine-tune for 3 epochs, across all our experiments, using Adam W optimizer (Loshchilov & Hutter, 2017) with a learning rate of 10^-5, and a Warmup LR scheduler (Kim et al., 2021) with a warmup ratio of 0.1. For efficient memory management, we used sample packing with a maximum sequence length of 8192 tokens for Deep Seek Coder-6.7B and 16384 tokens for Qwen Coder variants, with batch sizes of 4 and 1 per GPU, respectively. Gradient accumulation steps were set to 4, resulting in respective effective batch sizes of 64 and 32. We fix the periodicity to 1 epoch in the Sele KT algorithm unless specified otherwise, i.e., M = total number of mini-batches. We set sparsity α = 0.05 per layer.