How Does Code Pretraining Affect Language Model Task Performance?

Authors: Jackson Petty, Sjoerd van Steenkiste, Tal Linzen

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here we do just this. We pretrain language models on datasets which interleave natural language and code in two different settings: competitive, in which the total volume of data seen during pretraining is held constant; and additive, in which the volume of language data is held constant. We study how the pretraining mixture affects performance on (a) compositionality, measured by generalization accuracy on semantic parsing and syntactic transformation tasks, and more broadly on (b) downstream non-code-related objectives, measured by performance on tasks from the Big Bench benchmark. We find that pretraining on higher proportions of code improves performance on compositional tasks involving structured output (like semantic parsing), and mathematics. Conversely, increased code mixture can harm performance on other tasks, including on tasks that require sensitivity to linguistic structure such as syntax or morphology, and tasks measuring real-world knowledge.
Researcher Affiliation Collaboration Jackson Petty EMAIL Department of Linguistics New York University Sjoerd van Steenkiste EMAIL Google Research Tal Linzen EMAIL Google Research
Pseudocode No The paper describes its methodology and experimental setup in detail but does not include any explicitly labeled pseudocode or algorithm blocks. The methods are described in narrative text.
Open Source Code No The paper does not provide an explicit statement about releasing its own source code, nor does it provide any links to a code repository for the methodology described. It refers to open-source models like Llama 2 & Code Llama and Gemma & Code Gemma, but these are external models used for comparison, not the authors' own implementation.
Open Datasets Yes The ingredients for our datasets are the English portion of the Colossal Cleaned Common Crawl (C4; Raffel et al. 2020) and a version of the code portion of The Pile (Gao et al., 2020), itself taken from public Git Hub repositories; we use a version which has been cleaned to include only non-binary files smaller than 1MB with common code-related file extensions. To evaluate whether increased code mixture enables compositional generalization, we finetune our pretrained models on a suite of compositional generalization datasets: COGS (Kim & Linzen, 2020), a semantic parsing task in which natural-language sentences are transformed into a formal semantic representation; COGS-vf (Qiu et al., 2022), a variant of COGS which simplifies the output format; and English Passivization (Mueller et al., 2022), a natural-language transduction task in which synthetically-generated active-voice sentences are transformed into passive variants. We also evaluate models on Big Bench (Srivastava et al., 2023), a benchmark of 204 diverse and challenging tasks presented in a common format.
Dataset Splits Yes Each dataset contains training, validation, and generalization splits, where the generalization split is constructed to test licit-but-unattested combinations of familiar primitives. COGS and COGS-vf both divide their generalization split into two parts based on generalization type: either lexical... or structural...
Hardware Specification Yes We pretrain models on TPUs. We estimate that full replication of the pretraining procedure outlined here would take roughly 750 TPU-days of compute.
Software Dependencies No We construct 12-layer decoder-only models in t5x Roberts et al. (2023). Model hyperparameters were chosen following the methodology of Wang et al. (2022) and Petty et al. (2024) to approximate decoder-only versions of T5-large...
Experiment Setup Yes We construct 12-layer decoder-only models in t5x Roberts et al. (2023). Model hyperparameters were chosen following the methodology of Wang et al. (2022) and Petty et al. (2024) to approximate decoder-only versions of T5-large, resulting in models with roughly 374 M parameters; see Appendix A for hyperparameter details. We pretrain these models with a base natural language data volume of 132 B tokens. This means that all models in the competitive setting were trained with Ntotal = 132 B tokens, while the models in the additive setting were trained with Nlang = 132 B tokens, and hence Ntotal varying between 132 B tokens and 264 B tokens depending on the mixture; we use a batch size of 128, meaning that models were trained for between 1 M and 2 M steps, depending on the mixture and setting. For each combination of code mixture and setting, we pretrain models from five different random seeds. In Appendix A: We use the baseline 374 M-parameter model configuration from Petty et al. (2024) for our experiments, which has nlayers = 24, dff= 2816, dmodel = dattention = 1024, and nheads = 64. For all compositional generalization datasets, we finetune models for 10 K steps.