reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

Authors: Ulyana Piterbarg, Lerrel Pinto, Rob Fergus

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform comprehensive evaluations comparing edit sequence code LMs against baselines on Human Eval, MBPP(+), Code Contests, DS-1000, and Big Code Bench. We show that models fine-tuned to iteratively synthesize code match or outperform baselines on pass@1, and exhibit better scaling across higher pass@k as a function of total test-time FLOPs.
Researcher Affiliation	Academia	Ulyana Piterbarg, Lerrel Pinto, & Rob Fergus New York University
Pseudocode	Yes	Pseudo-code as well as a visualization of each of these phases is provided in Figure 2.
Open Source Code	Yes	We open-source our code and models to https://lintseq.github.io/.
Open Datasets	Yes	To that end, we first pool the Python portions of two open-source instruction datasets for code synthesis: the GPT 3.5/4-based Magicoder instruction dataset and the Star Coder2-15B-based self-alignment training dataset (Wei et al., 2024b;a). ... Our pretraining data mix is inspired by Code Llama (Roziere et al., 2023), and reflects a code-skewed mixture of web text and raw Python sampled from Fine Web and The Stack, respectively (Penedo et al., 2024; Li et al., 2023).
Dataset Splits	No	The paper uses an instruction fine-tuning dataset created from pooled open-source datasets (Magicoder and Star Coder2) for supervised fine-tuning. However, it does not explicitly provide training, validation, and test splits for this specific fine-tuning dataset. Instead, it evaluates the fine-tuned models on separate, established benchmarks like Human Eval and MBPP(+), which have their own test sets.
Hardware Specification	Yes	We train both models for two epochs with a batch size of 524,288 tokens on an NVIDIA H100 node with four GPUs.
Software Dependencies	No	The paper mentions several software components like PyTorch FSDP, pylint, Hugging Face, Deep Speed, and difflib, but it does not specify any version numbers for these components.
Experiment Setup	Yes	We train both models for two epochs with a batch size of 524,288 tokens on an NVIDIA H100 node with four GPUs. Our experiments are supported by Pytorch FSDP (Zhao et al., 2023). ... Table 9: Architectural and pretraining hyperparameters of our on device 150M and 400M parameter Tiny Code LM models, pretrained on a mixture of Web text and code for Python understanding. ... Table 13: All other instruction fine-tuning settings, re-used across experiments.