reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scalable Influence and Fact Tracing for Large Language Model Pretraining

Authors: Tyler Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, Ian Tenney

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In quantitative evaluations on a fact tracing task, our method performs best at identifying examples that influence model predictions, but classical, model-agnostic retrieval methods such as BM25 still perform better at finding passages which explicitly contain relevant facts.
Researcher Affiliation	Collaboration	1Google Deep Mind 2UC San Diego
Pseudocode	No	The paper describes the methods in Section 3 and subsections using mathematical equations and descriptive text, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We release our prompt set and model outputs, along with a web-based visualization tool to explore influential examples for factual predictions, commonsense reasoning, arithmetic, and open-ended generation for an 8B-parameter LLM.1 1https://github.com/pair-code/pretraining-tda
Open Datasets	Yes	We pretrain a decoder-only language model on two epochs of English C4 (Raffel et al., 2020) for three model sizes: 154M, 1B, and 8B parameters, using the architecture described in Chowdhery et al. (2023). For factual recall prompts, we use the filtered T-REx dataset from KILT (Petroni et al., 2021), which consists of entity-relation-entity triples such as (Carleton College, country, USA).
Dataset Splits	Yes	For all reported experiments, we use the same subsample of 5.4K facts balanced for fact frequency. Specifically, we separate facts into six frequency buckets: 1 to 10, 10 to 102, 102 to 103, 103 to 104, 104 to 105, and 105 to 106 occurrences in C4, with frequency annotations described above. We randomly sample up to 1K facts from each frequency bucket. Per bucket, we restrict facts with a given relation and target entity (e.g. country , USA ) to 25 examples, and we restrict each target and relation overall to 100 examples.
Hardware Specification	No	The paper mentions training models of various sizes (154M, 1B, and 8B parameters) and using the T5X framework, but it does not specify the particular GPU, CPU, or TPU models used for the experiments.
Software Dependencies	No	All of our models are implemented using T5X (Roberts et al., 2022). For all model sizes, we use the same Sentence Piece tokenizer (Kudo & Richardson, 2018) trained on C4 data with vocabulary size 32K. This text mentions specific software but does not include version numbers for T5X or Sentence Piece.
Experiment Setup	Yes	We pretrain with batch size 1024 and sequence length 2048 for two epochs (187K steps). The 154M, 1B, and 8B models reach eval losses (log-perplexities) of 2.34, 1.99, and 1.77 respectively. Specific hyperparameters are in Table A.4. Table A.4 further details parameters like Layers, Embedding size, Hidden size, MLP hidden size, Attention heads, Attention head size, Optimizer (Adafactor), Learning rate (0.01), Vocabulary size (32K), Batch size (1024), Sequence length (2048), Activation function (SwiGLU), Attention type (Multi-query), Position embedding (RoPE), Learning rate decay (Inverse square root), Warmup steps (10K), and Dropout (0.0).