reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

How Does Code Pretraining Affect Language Model Task Performance?

Authors: Jackson Petty, Sjoerd van Steenkiste, Tal Linzen

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Here we do just this. We pretrain language models on datasets which interleave natural language and code in two diﬀerent settings: competitive, in which the total volume of data seen during pretraining is held constant; and additive, in which the volume of language data is held constant. We study how the pretraining mixture aﬀects performance on (a) compositionality, measured by generalization accuracy on semantic parsing and syntactic transformation tasks, and more broadly on (b) downstream non-code-related objectives, measured by performance on tasks from the Big Bench benchmark. We ﬁnd that pretraining on higher proportions of code improves performance on compositional tasks involving structured output (like semantic parsing), and mathematics. Conversely, increased code mixture can harm performance on other tasks, including on tasks that require sensitivity to linguistic structure such as syntax or morphology, and tasks measuring real-world knowledge.
Researcher Affiliation	Collaboration	Jackson Petty EMAIL Department of Linguistics New York University Sjoerd van Steenkiste EMAIL Google Research Tal Linzen EMAIL Google Research
Pseudocode	No	The paper describes its methodology and experimental setup in detail but does not include any explicitly labeled pseudocode or algorithm blocks. The methods are described in narrative text.
Open Source Code	No	The paper does not provide an explicit statement about releasing its own source code, nor does it provide any links to a code repository for the methodology described. It refers to open-source models like Llama 2 & Code Llama and Gemma & Code Gemma, but these are external models used for comparison, not the authors' own implementation.
Open Datasets	Yes	The ingredients for our datasets are the English portion of the Colossal Cleaned Common Crawl (C4; Raﬀel et al. 2020) and a version of the code portion of The Pile (Gao et al., 2020), itself taken from public Git Hub repositories; we use a version which has been cleaned to include only non-binary ﬁles smaller than 1MB with common code-related ﬁle extensions. To evaluate whether increased code mixture enables compositional generalization, we ﬁnetune our pretrained models on a suite of compositional generalization datasets: COGS (Kim & Linzen, 2020), a semantic parsing task in which natural-language sentences are transformed into a formal semantic representation; COGS-vf (Qiu et al., 2022), a variant of COGS which simpliﬁes the output format; and English Passivization (Mueller et al., 2022), a natural-language transduction task in which synthetically-generated active-voice sentences are transformed into passive variants. We also evaluate models on Big Bench (Srivastava et al., 2023), a benchmark of 204 diverse and challenging tasks presented in a common format.
Dataset Splits	Yes	Each dataset contains training, validation, and generalization splits, where the generalization split is constructed to test licit-but-unattested combinations of familiar primitives. COGS and COGS-vf both divide their generalization split into two parts based on generalization type: either lexical... or structural...
Hardware Specification	Yes	We pretrain models on TPUs. We estimate that full replication of the pretraining procedure outlined here would take roughly 750 TPU-days of compute.
Software Dependencies	No	We construct 12-layer decoder-only models in t5x Roberts et al. (2023). Model hyperparameters were chosen following the methodology of Wang et al. (2022) and Petty et al. (2024) to approximate decoder-only versions of T5-large...
Experiment Setup	Yes	We construct 12-layer decoder-only models in t5x Roberts et al. (2023). Model hyperparameters were chosen following the methodology of Wang et al. (2022) and Petty et al. (2024) to approximate decoder-only versions of T5-large, resulting in models with roughly 374 M parameters; see Appendix A for hyperparameter details. We pretrain these models with a base natural language data volume of 132 B tokens. This means that all models in the competitive setting were trained with Ntotal = 132 B tokens, while the models in the additive setting were trained with Nlang = 132 B tokens, and hence Ntotal varying between 132 B tokens and 264 B tokens depending on the mixture; we use a batch size of 128, meaning that models were trained for between 1 M and 2 M steps, depending on the mixture and setting. For each combination of code mixture and setting, we pretrain models from ﬁve diﬀerent random seeds. In Appendix A: We use the baseline 374 M-parameter model conﬁguration from Petty et al. (2024) for our experiments, which has nlayers = 24, dﬀ= 2816, dmodel = dattention = 1024, and nheads = 64. For all compositional generalization datasets, we ﬁnetune models for 10 K steps.