Scalable Influence and Fact Tracing for Large Language Model Pretraining
Authors: Tyler Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, Ian Tenney
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In quantitative evaluations on a fact tracing task, our method performs best at identifying examples that influence model predictions, but classical, model-agnostic retrieval methods such as BM25 still perform better at finding passages which explicitly contain relevant facts. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2UC San Diego |
| Pseudocode | No | The paper describes the methods in Section 3 and subsections using mathematical equations and descriptive text, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release our prompt set and model outputs, along with a web-based visualization tool to explore influential examples for factual predictions, commonsense reasoning, arithmetic, and open-ended generation for an 8B-parameter LLM.1 1https://github.com/pair-code/pretraining-tda |
| Open Datasets | Yes | We pretrain a decoder-only language model on two epochs of English C4 (Raffel et al., 2020) for three model sizes: 154M, 1B, and 8B parameters, using the architecture described in Chowdhery et al. (2023). For factual recall prompts, we use the filtered T-REx dataset from KILT (Petroni et al., 2021), which consists of entity-relation-entity triples such as (Carleton College, country, USA). |
| Dataset Splits | Yes | For all reported experiments, we use the same subsample of 5.4K facts balanced for fact frequency. Specifically, we separate facts into six frequency buckets: 1 to 10, 10 to 102, 102 to 103, 103 to 104, 104 to 105, and 105 to 106 occurrences in C4, with frequency annotations described above. We randomly sample up to 1K facts from each frequency bucket. Per bucket, we restrict facts with a given relation and target entity (e.g. country , USA ) to 25 examples, and we restrict each target and relation overall to 100 examples. |
| Hardware Specification | No | The paper mentions training models of various sizes (154M, 1B, and 8B parameters) and using the T5X framework, but it does not specify the particular GPU, CPU, or TPU models used for the experiments. |
| Software Dependencies | No | All of our models are implemented using T5X (Roberts et al., 2022). For all model sizes, we use the same Sentence Piece tokenizer (Kudo & Richardson, 2018) trained on C4 data with vocabulary size 32K. This text mentions specific software but does not include version numbers for T5X or Sentence Piece. |
| Experiment Setup | Yes | We pretrain with batch size 1024 and sequence length 2048 for two epochs (187K steps). The 154M, 1B, and 8B models reach eval losses (log-perplexities) of 2.34, 1.99, and 1.77 respectively. Specific hyperparameters are in Table A.4. Table A.4 further details parameters like Layers, Embedding size, Hidden size, MLP hidden size, Attention heads, Attention head size, Optimizer (Adafactor), Learning rate (0.01), Vocabulary size (32K), Batch size (1024), Sequence length (2048), Activation function (SwiGLU), Attention type (Multi-query), Position embedding (RoPE), Learning rate decay (Inverse square root), Warmup steps (10K), and Dropout (0.0). |