reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data

Authors: Xinyi Wang, Antonis Antoniades, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, William Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using the Pythia models trained on the Pile dataset, we evaluate four distinct tasks: machine translation, factual question answering, world knowledge understanding, and math reasoning. Our findings reveal varying levels of memorization...
Researcher Affiliation	Collaboration	1University of California, Santa Barbara, 2Allen Institute for AI, 3University of Washington, 4 Synth Labs, 5Carnegie Mellon University
Pseudocode	No	The paper describes methods such as defining distributional memorization, task-gram language models, and how to search for n-gram co-occurrences, but it does not present these methods within a structured pseudocode or algorithm block.
Open Source Code	Yes	1Our code is available at: https://github.com/a-antoniades/llm-corpus-search
Open Datasets	Yes	Our experiments focus on the Pythia model family (Biderman et al., 2023), pretrained on the Pile dataset (Gao et al., 2020), and evaluate performance across three tasks: translation (WMT (Callison Burch et al., 2009)), factual question answering (Trivia QA (Joshi et al., 2017)), world knowledge questions (MMLU (Hendrycks et al., 2020)), and math reasoning (GSM8K (Cobbe et al., 2021)).
Dataset Splits	Yes	For translation, we use the WMT-09 dataset (Callison-Burch et al., 2009) with a 2.5K testing set. For factual question answering, we use the Trivia QA dataset (Joshi et al., 2017) with a 10K testing set... For MMLU, since the training set is very small (100 500 examples for each task), we mine the n-gram pairs from the test sets directly. For GSM8K, due to the short time limitation during rebuttal, we also mine the n-gram pairs from the 1K test sets directly.
Hardware Specification	Yes	We perform our experiments on 8 GPU 40G A100 working stations.
Software Dependencies	No	The paper mentions several tools and models like Pythia, OLMo, WIMBD, GPT4o, LASER embeddings, and E5 embeddings but does not provide specific version numbers for these software dependencies or the programming language used for implementation.
Experiment Setup	Yes	Our experiments focus on the Pythia model family (Biderman et al., 2023), with a wide range of model sizes ranging from 13M to 12B parameters. All Pythia models are trained on Pile (Gao et al., 2020)... In WMT, the cosine similarity thresholds we use are: 0.85, 0.8, 0.75, and 0.7 for 2 to 5-gram pairs, respectively; For Trivia QA and MMLU, the values are 0.75 and 0.65 for 3 and 5-gram pairs, respectively. We use lower thresholds for larger n-grams because larger n-grams inherently impose stricter alignment, and are therefore less likely.