Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

Authors: Ulyana Piterbarg, Lerrel Pinto, Rob Fergus

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform comprehensive evaluations comparing edit sequence code LMs against baselines on Human Eval, MBPP(+), Code Contests, DS-1000, and Big Code Bench. We show that models fine-tuned to iteratively synthesize code match or outperform baselines on pass@1, and exhibit better scaling across higher pass@k as a function of total test-time FLOPs.
Researcher Affiliation Academia Ulyana Piterbarg, Lerrel Pinto, & Rob Fergus New York University
Pseudocode Yes Pseudo-code as well as a visualization of each of these phases is provided in Figure 2.
Open Source Code Yes We open-source our code and models to https://lintseq.github.io/.
Open Datasets Yes To that end, we first pool the Python portions of two open-source instruction datasets for code synthesis: the GPT 3.5/4-based Magicoder instruction dataset and the Star Coder2-15B-based self-alignment training dataset (Wei et al., 2024b;a). ... Our pretraining data mix is inspired by Code Llama (Roziere et al., 2023), and reflects a code-skewed mixture of web text and raw Python sampled from Fine Web and The Stack, respectively (Penedo et al., 2024; Li et al., 2023).
Dataset Splits No The paper uses an instruction fine-tuning dataset created from pooled open-source datasets (Magicoder and Star Coder2) for supervised fine-tuning. However, it does not explicitly provide training, validation, and test splits for *this specific fine-tuning dataset*. Instead, it evaluates the fine-tuned models on separate, established benchmarks like Human Eval and MBPP(+), which have their own test sets.
Hardware Specification Yes We train both models for two epochs with a batch size of 524,288 tokens on an NVIDIA H100 node with four GPUs.
Software Dependencies No The paper mentions several software components like PyTorch FSDP, pylint, Hugging Face, Deep Speed, and difflib, but it does not specify any version numbers for these components.
Experiment Setup Yes We train both models for two epochs with a batch size of 524,288 tokens on an NVIDIA H100 node with four GPUs. Our experiments are supported by Pytorch FSDP (Zhao et al., 2023). ... Table 9: Architectural and pretraining hyperparameters of our on device 150M and 400M parameter Tiny Code LM models, pretrained on a mixture of Web text and code for Python understanding. ... Table 13: All other instruction fine-tuning settings, re-used across experiments.