reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

NextCoder: Robust Adaptation of Code LMs to Diverse Code Edits

Authors: Tushar Aggarwal, Swayam Singh, Abhijeet Awasthi, Aditya Kanade, Nagarajan Natarajan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using our approach, we obtain a new series of models Next Coder (adapted from Qwen Coder-2.5) that achieves strong results on five code-editing benchmarks, outperforming comparable size models and even several larger ones. We show the generality of our approach on two model families (Deep Seek Coder and Qwen Coder), compare against other fine-tuning approaches, and demonstrate robustness by showing retention of code generation and general problemsolving abilities post adaptation.
Researcher Affiliation	Industry	1Microsoft Research India. Correspondence to: Tushar Aggarwal <EMAIL>, Swayam Singh <EMAIL>, Abhijeet Awasthi <EMAIL>, Aditya Kanade <EMAIL>, Nagarajan Natarajan <EMAIL>.
Pseudocode	Yes	Algorithm 1 Sele KT: Selective Knowledge Transfer Require: Base LM weights θbase, training data D, epochs E, periodicity M, sparsity α. Ensure: Final fine-tuned weights θFT. 1: Initialize θ ← θbase. 2: for epoch e = 1 to E do 3: for each minibatch D[s] do 4: θ ← Train Step(θ, D[s]) [Dense Gradients] 5: if s mod M = 0 then 6: Compute task vector: τ ← θ − θbase 7: Select top-αN parameters: (1, i ∈ top-k(\|τ\|, α N ) 0, otherwise 8: θ ← θbase + γ τ [Sparse Projection] 9: end if 10: end for 11: end for 12: return θ as θFT.
Open Source Code	Yes	We opensource the models, synthetic dataset, and implementation at aka.ms/nextcoder.
Open Datasets	Yes	We opensource the models, synthetic dataset, and implementation at aka.ms/nextcoder.
Dataset Splits	No	In addition to the synthetic data (Table 1), we used 127K instances from Commit Pack FT to fine-tune our models. The paper does not specify explicit training/validation/test splits for this combined dataset.
Hardware Specification	Yes	For fine-tuning and inference, we use 8 NVIDIA H100 GPUs, each with 80GB of VRAM. For data generation using GPT-4o (version 2024-05-13), we use the Open AI API. Following Singhal et al. (2024), we perform run-time evaluations for No Fun Eval on an Azure NC16 VM (NC16).
Software Dependencies	No	The paper mentions using 'Adam W optimizer' and 'Warmup LR scheduler', 'Deep Speed' (Rajbhandari et al., 2020) and 'bfloat16' for memory optimizations, but does not provide specific version numbers for these software components or libraries.
Experiment Setup	Yes	We fine-tune for 3 epochs, across all our experiments, using Adam W optimizer (Loshchilov & Hutter, 2017) with a learning rate of 10^-5, and a Warmup LR scheduler (Kim et al., 2021) with a warmup ratio of 0.1. For efficient memory management, we used sample packing with a maximum sequence length of 8192 tokens for Deep Seek Coder-6.7B and 16384 tokens for Qwen Coder variants, with batch sizes of 4 and 1 per GPU, respectively. Gradient accumulation steps were set to 4, resulting in respective effective batch sizes of 64 and 32. We fix the periodicity to 1 epoch in the Sele KT algorithm unless specified otherwise, i.e., M = total number of mini-batches. We set sparsity α = 0.05 per layer.