How Much Can We Forget about Data Contamination?
Authors: Sebastian Bordt, Suraj Srinivas, Valentyn Boreiko, Ulrike Von Luxburg
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). If model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. Continual pre-training of OLMo-7B corroborates these results. |
| Researcher Affiliation | Collaboration | 1University of Tübingen, Tübingen AI Center, Germany 2Bosch Research North America & Bosch Center for Artificial Intelligence (BCAI), Sunnyvale, USA. |
| Pseudocode | No | The paper describes parameter updates using equations (1), (2), (3), and (4) in Section 5.1, but these are mathematical formulations and not presented as a structured pseudocode or algorithm block. |
| Open Source Code | Yes | Code is available at https://github.com/tml-tuebingen/forgetting-contamination/. The code for this paper is available at github.com/tml-tuebingen/forgetting-contamination. |
| Open Datasets | Yes | The training data is the 100BT split of the Fine Web-Edu dataset (Lozhkov et al., 2024). We trained on the 100BT split of the Fine Web-Edu dataset, available at huggingface.co/datasets/Hugging Face FW/fineweb-edu. |
| Dataset Splits | Yes | A holdout set of 10,000 benchmark questions is never added to the training data. The other subsets are added to the training data, repeated either 4, 12, 36, or 144 times. |
| Hardware Specification | Yes | Model training relied on Pytorch (Paszke et al., 2019) and was performed on 8x A100 nodes for all experiments except the continual pre-training of OLMo-7B, which ran for 6 weeks on 4x H100. |
| Software Dependencies | No | Model training relied on Pytorch (Paszke et al., 2019) and was performed on 8x A100 nodes for all experiments except the continual pre-training of OLMo-7B, which ran for 6 weeks on 4x H100. It relies on the OLMo codebase, available at github.com/allenai/OLMo, and the llm.c codebase, available at github.com/karpathy/llm.c. While software names are mentioned, specific version numbers for PyTorch, OLMo codebase, or llm.c codebase are not provided within the paper text. |
| Experiment Setup | Yes | We train language models of up to 1.6B parameters using the architecture and hyperparameters from the GPT-3 paper (Brown et al., 2020, Table 2.1). For this, we adopt the llm.c codebase. We consider exact contamination, that is we contaminate the training data with the same texts that the model is later evaluated on. We insert benchmark questions individually and at random positions into the training data. Models are evaluated zero-shot via the likelihood assigned to different sentence completions (Gao, 2021). We consider the contaminated model from Section 4.2 after two Chinchilla and continue training with four different choices of the weight decay parameter. |