Position: Language model developers should report train-test overlap

Authors: Andy K Zhang, Kevin Klyman, Yifan Mai, Yoav Levine, Yian Zhang, Rishi Bommasani, Percy Liang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To make this clear, we document the practices of 30 models, finding that just 9 models report train-test overlap: 4 models release training data under open-source licenses, enabling the community to directly measure train-test overlap, and 5 models publish their train-test overlap methodology and statistics.
Researcher Affiliation Academia 1 Stanford University, Stanford, CA, USA. Correspondence to: Andy, Zhang <EMAIL>.
Pseudocode Yes Algorithm 1 Compute Overlapping N-grams
Open Source Code No Algorithm for computing overlapping n-grams and frequencies. Code will be released on Git Hub.
Open Datasets Yes The Pile: An 800gb dataset of diverse text for language modeling, 2020. URL https: //arxiv.org/abs/2101.00027.
Dataset Splits No The paper does not provide specific dataset split information for its own analysis. It discusses train-test overlap as a concept and how other models use test sets, but does not detail splits for the data used in its study on reporting practices.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for its own analysis or experiments.
Software Dependencies No The paper does not provide specific ancillary software details (library or solver names with version numbers) needed to replicate its analysis.
Experiment Setup No The paper describes its methodology for surveying model developers and their reporting practices, but it does not detail an 'experimental setup' with concrete hyperparameter values or training configurations in the main text for any model training or similar experiment conducted by the authors.