reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models

Authors: Laura Ruis, Maximilian Mozes, Juhan Bae, Siddhartha Rao Kamalakara, Dwaraknath Gnaneshwar, Acyr Locatelli, Robert Kirk, Tim Rocktaeschel, Edward Grefenstette, Max Bartolo

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	For two models of different sizes (7B and 35B) and 2.5B of their pretraining tokens, we identify what documents influence the model outputs for three simple mathematical reasoning tasks and contrast this to the data that are influential for answering factual questions.
Researcher Affiliation	Collaboration	Laura Ruis AI Centre, UCL Maximilian Mozes Cohere Juhan Bae University of Toronto & Vector Institute Siddhartha Rao Kamalakara Cohere Dwarak Talupuru Cohere Acyr Locatelli Cohere Robert Kirk AI Centre, UCL Tim Rockt aschel AI Centre, UCL Edward Grefenstette AI Centre, UCL Max Bartolo Cohere
Pseudocode	No	The paper describes its methodology in narrative text and provides examples of code found in influential documents from pretraining data (e.g., a JavaScript function for calculating slope) but does not present any structured pseudocode or algorithm blocks for its own methods.
Open Source Code	No	The code we use for EK-FAC influence functions at scale is a part of larger internal infrastructure, and hence cannot be released publicly. Although this work is based on proprietary models and pretraining data, we make the following efforts for reproducibility. For one of the models we use (the 35B model), the final-stage model (further trained after SFT) is publicly available on Hugging Face.4
Open Datasets	Yes	To compare results of using EK-FAC influence functions with different approximations, we use the same fine-tuned model from Section A.1 to calculate influence scores for the 4656 training examples (i.e. documents) on the first 32 validation examples (i.e. queries) of the Wikitext-2 (Merity et al., 2016). ... DROP (Dua et al., 2019) and RACE (Lai et al., 2017).
Dataset Splits	Yes	We take GPT-2small (124M) from Hugging Face,5 and fine-tune it for three epochs with next-word prediction on Wikitext-2 (Merity et al., 2016). ... evaluate it on 50 validation examples with a metric (perplexity or accuracy). ... We randomly select a subset of 8000 examples for fine-tuning, and use the procedure described above to perform counterfactual experiments. ... We apply the exact same procedure to the RACE dataset, except now we keep 10k examples (empirically found to lead to the least overfitting when fine-tuning).
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory amounts) used to run its experiments.
Software Dependencies	No	The paper mentions using 'Adam optimizer' (Kingma & Ba, 2015) and referencing 'Dask documentation (Dask Development Team, 2016)' but does not provide specific version numbers for these or any other key software dependencies required for replication.
Experiment Setup	Yes	We take GPT-2small (124M) from Hugging Face,5 and fine-tune it for three epochs with next-word prediction on Wikitext-2 (Merity et al., 2016). We use Adam optimizer (Kingma & Ba, 2015) with default parameters (b1 0.9, b2 0.999, eps 1e-8, additive weight decay 0.01). ... We use Adam optimizer again, with the same hyperparameters as for the above experiment: b1 0.9, b2 0.999, eps 1e-8, additive weight decay 0.01, but only train for one epoch.