reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Training on the Test Task Confounds Evaluation and Emergence

Authors: Ricardo Dominguez-Olmedo, Florian Eddie Dorner, Moritz Hardt

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a malpractice. Rather, the term describes a growing set of practices that utilize knowledge about evaluation tasks at training time. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for the effect of training on the test task on benchmark evaluations. Put simply, to fine-tune each model under comparison on the same task-relevant data before evaluation. We then show that instances of emergent behavior disappear gradually as models train on the test task. Our work promotes a new perspective on the evaluation of large language models, with broad implications for benchmarking and the study of emergent capabilities. Our analysis spans 56 different language models and two major active benchmarks, MMLU and GSM8K. We start in Section 2 by dividing models into those trained before November 2023 and those trained after. We find that for the same amount of pretraining compute, newer models strongly outperform older ones, on average by 7 percentage points in MMLU and 19 points in GSM8K. We then finetune all models on the same amount of task-specific data before evaluation. After fine-tuning on the same task data, newer models no longer outperform older ones. Rather, their performance equalizes. See Figure 1.
Researcher Affiliation	Academia	Ricardo Dominguez-Olmedo ,1, Florian E. Dorner1,2, and Moritz Hardt1 1Max Planck Institute for Intelligent Systems, T ubingen, and T ubingen AI Center 2ETH Z urich
Pseudocode	No	The paper describes methods and analyses results using prose, mathematical equations, and figures. It does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using a third-party tool, "LM Evaluation Harness (Eleuther AI, 2024)", and provides its GitHub link in the references. However, there is no explicit statement or link indicating that the authors' own code for the methodology described in this paper is publicly released or provided in supplementary materials.
Open Datasets	Yes	We choose MMLU (Hendrycks et al., 2020) and GSM8K (Cobbe et al., 2021) as a case study for investigating training on the test task in active benchmarks. For multiple choice questioning answering, we use the auxiliary training set accompanying the HF MMLU repository2. This training set is not an i.i.d. split of MMLU. Instead, it consists of the training sets from other multiple-choice question-answering benchmarks, comprising approximately 100,000 training examples and 30 million tokens. For mathematical reasoning, we combine Meta Math QA (Yu et al., 2023b) and Orca-Math (Mitra et al., 2024), totalling approximately 600,000 training examples and 200M tokens. 2https://huggingface.co/datasets/cais/mmlu
Dataset Splits	Yes	We choose MMLU (Hendrycks et al., 2020) and GSM8K (Cobbe et al., 2021) as a case study for investigating training on the test task in active benchmarks. For multiple choice questioning answering, we use the auxiliary training set accompanying the HF MMLU repository. This training set is not an i.i.d. split of MMLU. Instead, it consists of the training sets from other multiple-choice question-answering benchmarks, comprising approximately 100,000 training examples and 30 million tokens. For mathematical reasoning, we combine Meta Math QA (Yu et al., 2023b) and Orca-Math (Mitra et al., 2024), totalling approximately 600,000 training examples and 200M tokens. We evaluate models using LM Evaluation Harness (Eleuther AI, 2024), in identical fashion to the HF leaderboard1. We also evaluate MMLU and GSM8K 5-shot, ARC 25-shot, and Hella Swag 10-shot.
Hardware Specification	Yes	For models with less than 10B parameters, we fine-tune on a single GPU with BF16 precision. For models between 10B and 30B parameters, we train on a single H100 node using Deep Speed Ze RO-3 (Rajbhandari et al., 2020) and full precision. For models with more than 30B parameters, we train on two H100 nodes using Deep Speed Ze RO-3 and full precision. We use an internal cluster of A100 and H100 GPUs.
Software Dependencies	No	The paper mentions using "LM Evaluation Harness (Eleuther AI, 2024)" and "Deep Speed Ze RO-3" but does not provide specific version numbers for these or any other software libraries (e.g., Python, PyTorch, CUDA) used to reproduce the experiments.
Experiment Setup	Yes	We fine-tune models for three epochs using standard hyperparameter choices, see Appendix B.2. We use a learning rate of 2 10 5 for models with fewer than 10B parameters and a learning rate of 2 10 6 for models with more than 10B parameters. We use a cosine learning rate schedule with linear warm-up for 50 steps and decay to 10% of the peak learning rate. We use Adam W (Loshchilov & Hutter, 2018) as the optimizer, with β1 = 0.9, β2 = 0.95, and ϵ = 10 8. We fine-tune with batch size 64. We use a weight decay rate of 0.1 and clip gradients at 1.0. To reduce the computation burden of fine-tuning, we train with context size 600.