reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Contamination Budget: Trade-offs Between Breadth, Depth and Difficulty

Authors: Behzad Mehrbakhsh, Fernando Martínez-Plumed, José Hernández-Orallo

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Here we explore when this contamination is performed intentionally, for purposes that can be malicious (e.g., get better scores in evaluations) or benevolent (e.g., fix some mistakes). These interventions, usually in the form of fine-tuning memorisations, come with a budget in the size of the fine-tuning dataset. Several trade-offs appear between the breadth of the intervention (how many examples to be memorised), its depth (how many repetitions of each example) and the difficulty of the examples. By studying several LLMs and datasets, we observe some monotonic behaviour (more difficult items require more depth to be fixed ) but also some non-monotonic phenomena (very high depth levels have negative effects on non-contaminated examples). This suggests that trade-offs should be found not only in terms of the budget but also according to model specifics, the task and the item difficulty at hand. Section 4, titled 'Methods', details the 'Tasks', 'Difficulty estimation', 'Training, test and validation subsets', 'Models', 'Contamination budget', and 'Fine-tuning' used in the experiments. RQ5 also asks 'Based on ablation studies, how do model architecture and size affect the success of breadth and depth strategies in LLM contamination interventions?'
Researcher Affiliation	Academia	Behzad Mehrbakhsh1,2 , Fernando Mart ınez-Plumed1,2 , Jos e Hern andez-Orallo1,2 1UPV Universitat Polit ecnica de Val encia 2VRAIN Valencian Research Institute for Artificial Intelligence EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes mathematical functions and strategies (e.g., Eq. 2 for selection strategy σ) but does not include any explicitly labeled pseudocode or algorithm blocks. The methods are described in narrative text and mathematical notation.
Open Source Code	No	The paper states: 'All fine-tuning experiments are performed using the Hugging Face Transformers library3 with Py Torch [Paszke et al., 2019] as the back-end.' Footnote 3 provides a link to the Hugging Face Transformers library (https://github.com/huggingface/transformers), which is a third-party tool used by the authors, not their own source code for the methodology described in the paper. There is no explicit statement about releasing their own code or a link to a repository for their implementation.
Open Datasets	Yes	4.1 Tasks: We select recently published benchmarks to minimise the likelihood of pre-training contamination, and multiple-choice and open-ended tasks whose responses can be validated by exact string matching. MMLU-Pro [Wang et al., 2024], an enhanced version of the MMLU benchmark [Hendrycks et al., 2020], including more challenging, reasoning-oriented questions, increasing the number of answer choices from four to ten, and removing trivial and noisy questions found in MMLU. Med MCQA [Pal et al., 2022], a dataset of medical multiple-choice questions designed to assess clinical knowledge comprehension (e.g., real-world medical entrance exam questions). Addition [Zhou et al., 2024], containing 10,000 addition instances with addends ranging from 3 to 9 digits. Anagram [Zhou et al., 2024], a set of 9,000 anagrams of common English words from the Google Web Trillion Word Corpus with a length ranging from 3 to 7.
Dataset Splits	Yes	4.3 Training, test and validation subsets: All the datasets in 4.1 are divided into several subsets: Test set: Dt X, used to evaluate the performance of the models after fine-tuning. Fine-tuning set: Df, including a series of n instances from Dt one or more times according to different contamination scenarios (see 4.5). Non-contaminated set: Du= Dt ∖ Df, which consists of the instances in the test set that are not fine-tuned. Validation set: Dv X, where Dv ∩ Dt = ∅ (so no contaminated instances), used to monitor performance after fine-tuning. We chose \|Dt\| = b = 500 items1. For Dt we randomly sample 100 instances from each of the five difficulty bins (see 4.2). ... Df is sampled from the test set based on the scenarios in 4.5. A similar methodology was used to construct Dv, which also consists of 100 items per difficulty bin, totalling 500 items.
Hardware Specification	No	The paper states: 'All fine-tuning experiments are performed using the Hugging Face Transformers library3 with Py Torch [Paszke et al., 2019] as the back-end.' However, it does not specify any particular hardware components such as GPU models, CPU types, or memory specifications used for these experiments.
Software Dependencies	No	The paper mentions using 'Hugging Face Transformers library' and 'Py Torch [Paszke et al., 2019]' as the back-end, and 'Lo RA [Yu et al., 2023]' for parameter-efficient fine-tuning. However, it does not provide specific version numbers for any of these software components.
Experiment Setup	Yes	4.6 Fine-tuning: We use the same fine-tuning hyperparameters in all experiments: learning rate (4 × 10−4 ), batch size (8), number of epochs (1), optimiser (paged adamw 8bit [Loshchilov, 2017]). Lo RA [Yu et al., 2023] is employed for parameter-efficient fine-tuning, with low-rank adaptation (r = 16), a scaling factor (α = 32), and a dropout rate of 0.05.