reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

How Much Can We Forget about Data Contamination?

Authors: Sebastian Bordt, Suraj Srinivas, Valentyn Boreiko, Ulrike Von Luxburg

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). If model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. Continual pre-training of OLMo-7B corroborates these results.
Researcher Affiliation	Collaboration	1University of Tübingen, Tübingen AI Center, Germany 2Bosch Research North America & Bosch Center for Artificial Intelligence (BCAI), Sunnyvale, USA.
Pseudocode	No	The paper describes parameter updates using equations (1), (2), (3), and (4) in Section 5.1, but these are mathematical formulations and not presented as a structured pseudocode or algorithm block.
Open Source Code	Yes	Code is available at https://github.com/tml-tuebingen/forgetting-contamination/. The code for this paper is available at github.com/tml-tuebingen/forgetting-contamination.
Open Datasets	Yes	The training data is the 100BT split of the Fine Web-Edu dataset (Lozhkov et al., 2024). We trained on the 100BT split of the Fine Web-Edu dataset, available at huggingface.co/datasets/Hugging Face FW/fineweb-edu.
Dataset Splits	Yes	A holdout set of 10,000 benchmark questions is never added to the training data. The other subsets are added to the training data, repeated either 4, 12, 36, or 144 times.
Hardware Specification	Yes	Model training relied on Pytorch (Paszke et al., 2019) and was performed on 8x A100 nodes for all experiments except the continual pre-training of OLMo-7B, which ran for 6 weeks on 4x H100.
Software Dependencies	No	Model training relied on Pytorch (Paszke et al., 2019) and was performed on 8x A100 nodes for all experiments except the continual pre-training of OLMo-7B, which ran for 6 weeks on 4x H100. It relies on the OLMo codebase, available at github.com/allenai/OLMo, and the llm.c codebase, available at github.com/karpathy/llm.c. While software names are mentioned, specific version numbers for PyTorch, OLMo codebase, or llm.c codebase are not provided within the paper text.
Experiment Setup	Yes	We train language models of up to 1.6B parameters using the architecture and hyperparameters from the GPT-3 paper (Brown et al., 2020, Table 2.1). For this, we adopt the llm.c codebase. We consider exact contamination, that is we contaminate the training data with the same texts that the model is later evaluated on. We insert benchmark questions individually and at random positions into the training data. Models are evaluated zero-shot via the likelihood assigned to different sentence completions (Gao, 2021). We consider the contaminated model from Section 4.2 after two Chinchilla and continue training with four different choices of the weight decay parameter.