Position: Language model developers should report train-test overlap
Authors: Andy K Zhang, Kevin Klyman, Yifan Mai, Yoav Levine, Yian Zhang, Rishi Bommasani, Percy Liang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To make this clear, we document the practices of 30 models, finding that just 9 models report train-test overlap: 4 models release training data under open-source licenses, enabling the community to directly measure train-test overlap, and 5 models publish their train-test overlap methodology and statistics. |
| Researcher Affiliation | Academia | 1 Stanford University, Stanford, CA, USA. Correspondence to: Andy, Zhang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Compute Overlapping N-grams |
| Open Source Code | No | Algorithm for computing overlapping n-grams and frequencies. Code will be released on Git Hub. |
| Open Datasets | Yes | The Pile: An 800gb dataset of diverse text for language modeling, 2020. URL https: //arxiv.org/abs/2101.00027. |
| Dataset Splits | No | The paper does not provide specific dataset split information for its own analysis. It discusses train-test overlap as a concept and how other models use test sets, but does not detail splits for the data used in its study on reporting practices. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for its own analysis or experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (library or solver names with version numbers) needed to replicate its analysis. |
| Experiment Setup | No | The paper describes its methodology for surveying model developers and their reporting practices, but it does not detail an 'experimental setup' with concrete hyperparameter values or training configurations in the main text for any model training or similar experiment conducted by the authors. |