Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data
Authors: Xinyi Wang, Antonis Antoniades, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, William Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using the Pythia models trained on the Pile dataset, we evaluate four distinct tasks: machine translation, factual question answering, world knowledge understanding, and math reasoning. Our findings reveal varying levels of memorization... |
| Researcher Affiliation | Collaboration | 1University of California, Santa Barbara, 2Allen Institute for AI, 3University of Washington, 4 Synth Labs, 5Carnegie Mellon University |
| Pseudocode | No | The paper describes methods such as defining distributional memorization, task-gram language models, and how to search for n-gram co-occurrences, but it does not present these methods within a structured pseudocode or algorithm block. |
| Open Source Code | Yes | 1Our code is available at: https://github.com/a-antoniades/llm-corpus-search |
| Open Datasets | Yes | Our experiments focus on the Pythia model family (Biderman et al., 2023), pretrained on the Pile dataset (Gao et al., 2020), and evaluate performance across three tasks: translation (WMT (Callison Burch et al., 2009)), factual question answering (Trivia QA (Joshi et al., 2017)), world knowledge questions (MMLU (Hendrycks et al., 2020)), and math reasoning (GSM8K (Cobbe et al., 2021)). |
| Dataset Splits | Yes | For translation, we use the WMT-09 dataset (Callison-Burch et al., 2009) with a 2.5K testing set. For factual question answering, we use the Trivia QA dataset (Joshi et al., 2017) with a 10K testing set... For MMLU, since the training set is very small (100 500 examples for each task), we mine the n-gram pairs from the test sets directly. For GSM8K, due to the short time limitation during rebuttal, we also mine the n-gram pairs from the 1K test sets directly. |
| Hardware Specification | Yes | We perform our experiments on 8 GPU 40G A100 working stations. |
| Software Dependencies | No | The paper mentions several tools and models like Pythia, OLMo, WIMBD, GPT4o, LASER embeddings, and E5 embeddings but does not provide specific version numbers for these software dependencies or the programming language used for implementation. |
| Experiment Setup | Yes | Our experiments focus on the Pythia model family (Biderman et al., 2023), with a wide range of model sizes ranging from 13M to 12B parameters. All Pythia models are trained on Pile (Gao et al., 2020)... In WMT, the cosine similarity thresholds we use are: 0.85, 0.8, 0.75, and 0.7 for 2 to 5-gram pairs, respectively; For Trivia QA and MMLU, the values are 0.75 and 0.65 for 3 and 5-gram pairs, respectively. We use lower thresholds for larger n-grams because larger n-grams inherently impose stricter alignment, and are therefore less likely. |