Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models

Authors: Laura Ruis, Maximilian Mozes, Juhan Bae, Siddhartha Rao Kamalakara, Dwaraknath Gnaneshwar, Acyr Locatelli, Robert Kirk, Tim Rocktaeschel, Edward Grefenstette, Max Bartolo

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For two models of different sizes (7B and 35B) and 2.5B of their pretraining tokens, we identify what documents influence the model outputs for three simple mathematical reasoning tasks and contrast this to the data that are influential for answering factual questions.
Researcher Affiliation Collaboration Laura Ruis AI Centre, UCL Maximilian Mozes Cohere Juhan Bae University of Toronto & Vector Institute Siddhartha Rao Kamalakara Cohere Dwarak Talupuru Cohere Acyr Locatelli Cohere Robert Kirk AI Centre, UCL Tim Rockt aschel AI Centre, UCL Edward Grefenstette AI Centre, UCL Max Bartolo Cohere
Pseudocode No The paper describes its methodology in narrative text and provides examples of code found in influential documents from pretraining data (e.g., a JavaScript function for calculating slope) but does not present any structured pseudocode or algorithm blocks for its own methods.
Open Source Code No The code we use for EK-FAC influence functions at scale is a part of larger internal infrastructure, and hence cannot be released publicly. Although this work is based on proprietary models and pretraining data, we make the following efforts for reproducibility. For one of the models we use (the 35B model), the final-stage model (further trained after SFT) is publicly available on Hugging Face.4
Open Datasets Yes To compare results of using EK-FAC influence functions with different approximations, we use the same fine-tuned model from Section A.1 to calculate influence scores for the 4656 training examples (i.e. documents) on the first 32 validation examples (i.e. queries) of the Wikitext-2 (Merity et al., 2016). ... DROP (Dua et al., 2019) and RACE (Lai et al., 2017).
Dataset Splits Yes We take GPT-2small (124M) from Hugging Face,5 and fine-tune it for three epochs with next-word prediction on Wikitext-2 (Merity et al., 2016). ... evaluate it on 50 validation examples with a metric (perplexity or accuracy). ... We randomly select a subset of 8000 examples for fine-tuning, and use the procedure described above to perform counterfactual experiments. ... We apply the exact same procedure to the RACE dataset, except now we keep 10k examples (empirically found to lead to the least overfitting when fine-tuning).
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory amounts) used to run its experiments.
Software Dependencies No The paper mentions using 'Adam optimizer' (Kingma & Ba, 2015) and referencing 'Dask documentation (Dask Development Team, 2016)' but does not provide specific version numbers for these or any other key software dependencies required for replication.
Experiment Setup Yes We take GPT-2small (124M) from Hugging Face,5 and fine-tune it for three epochs with next-word prediction on Wikitext-2 (Merity et al., 2016). We use Adam optimizer (Kingma & Ba, 2015) with default parameters (b1 0.9, b2 0.999, eps 1e-8, additive weight decay 0.01). ... We use Adam optimizer again, with the same hyperparameters as for the above experiment: b1 0.9, b2 0.999, eps 1e-8, additive weight decay 0.01, but only train for one epoch.