Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
Authors: Ulyana Piterbarg, Lerrel Pinto, Rob Fergus
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform comprehensive evaluations comparing edit sequence code LMs against baselines on Human Eval, MBPP(+), Code Contests, DS-1000, and Big Code Bench. We show that models fine-tuned to iteratively synthesize code match or outperform baselines on pass@1, and exhibit better scaling across higher pass@k as a function of total test-time FLOPs. |
| Researcher Affiliation | Academia | Ulyana Piterbarg, Lerrel Pinto, & Rob Fergus New York University |
| Pseudocode | Yes | Pseudo-code as well as a visualization of each of these phases is provided in Figure 2. |
| Open Source Code | Yes | We open-source our code and models to https://lintseq.github.io/. |
| Open Datasets | Yes | To that end, we first pool the Python portions of two open-source instruction datasets for code synthesis: the GPT 3.5/4-based Magicoder instruction dataset and the Star Coder2-15B-based self-alignment training dataset (Wei et al., 2024b;a). ... Our pretraining data mix is inspired by Code Llama (Roziere et al., 2023), and reflects a code-skewed mixture of web text and raw Python sampled from Fine Web and The Stack, respectively (Penedo et al., 2024; Li et al., 2023). |
| Dataset Splits | No | The paper uses an instruction fine-tuning dataset created from pooled open-source datasets (Magicoder and Star Coder2) for supervised fine-tuning. However, it does not explicitly provide training, validation, and test splits for *this specific fine-tuning dataset*. Instead, it evaluates the fine-tuned models on separate, established benchmarks like Human Eval and MBPP(+), which have their own test sets. |
| Hardware Specification | Yes | We train both models for two epochs with a batch size of 524,288 tokens on an NVIDIA H100 node with four GPUs. |
| Software Dependencies | No | The paper mentions several software components like PyTorch FSDP, pylint, Hugging Face, Deep Speed, and difflib, but it does not specify any version numbers for these components. |
| Experiment Setup | Yes | We train both models for two epochs with a batch size of 524,288 tokens on an NVIDIA H100 node with four GPUs. Our experiments are supported by Pytorch FSDP (Zhao et al., 2023). ... Table 9: Architectural and pretraining hyperparameters of our on device 150M and 400M parameter Tiny Code LM models, pretrained on a mixture of Web text and code for Python understanding. ... Table 13: All other instruction fine-tuning settings, re-used across experiments. |