reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Position: Language model developers should report train-test overlap

Authors: Andy K Zhang, Kevin Klyman, Yifan Mai, Yoav Levine, Yian Zhang, Rishi Bommasani, Percy Liang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To make this clear, we document the practices of 30 models, finding that just 9 models report train-test overlap: 4 models release training data under open-source licenses, enabling the community to directly measure train-test overlap, and 5 models publish their train-test overlap methodology and statistics.
Researcher Affiliation	Academia	1 Stanford University, Stanford, CA, USA. Correspondence to: Andy, Zhang <EMAIL>.
Pseudocode	Yes	Algorithm 1 Compute Overlapping N-grams
Open Source Code	No	Algorithm for computing overlapping n-grams and frequencies. Code will be released on Git Hub.
Open Datasets	Yes	The Pile: An 800gb dataset of diverse text for language modeling, 2020. URL https: //arxiv.org/abs/2101.00027.
Dataset Splits	No	The paper does not provide specific dataset split information for its own analysis. It discusses train-test overlap as a concept and how other models use test sets, but does not detail splits for the data used in its study on reporting practices.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for its own analysis or experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details (library or solver names with version numbers) needed to replicate its analysis.
Experiment Setup	No	The paper describes its methodology for surveying model developers and their reporting practices, but it does not detail an 'experimental setup' with concrete hyperparameter values or training configurations in the main text for any model training or similar experiment conducted by the authors.