reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Comprehensive Survey of Contamination Detection Methods in Large Language Models

Authors: Mathieu Ravaut, Bosheng Ding, Fangkai Jiao, Hailin Chen, Xingxuan Li, Ruochen Zhao, Chengwei Qin, Caiming Xiong, Shafiq Joty

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we survey all recent work on contamination detection with LLMs, analyzing their methodologies and use cases to shed light on the appropriate usage of contamination detection methods. Our work calls the NLP research community s attention into systematically taking into account contamination bias in LLM evaluation. ... We categorize contamination detection into two broad types of use cases vastly differing in the techniques involved, namely open-data and closed-data contamination detection, and review all existing work in each category. Our taxonomy further classifies detection techniques into finer-grained categories, each following their own assumptions and model requirements. Through our detailed classification and review of the types of contamination detection, we not only contribute to the academic and practical understanding of data contamination issues but we also highlight the pressing need for strategies that mitigate these risks.
Researcher Affiliation	Collaboration	Mathieu Ravaut EMAIL Nanyang Technological University Institute of Infocomm Research (I2R), A*STAR Bosheng Ding EMAIL Nanyang Technological University ... Caiming Xiong EMAIL Salesforce Research Shafiq Joty EMAIL Nanyang Technological University Salesforce Research
Pseudocode	No	The paper is a survey, analyzing and categorizing existing contamination detection methods in LLMs. It describes these methods in detail using prose and tables, but it does not present any pseudocode or algorithm blocks for its own methodology or contributions.
Open Source Code	No	The paper is a comprehensive survey of contamination detection methods in Large Language Models. It analyzes existing methodologies and does not propose new methods that would require releasing source code.
Open Datasets	Yes	Duan et al. (2024) introduce the MIMIR dataset, constructed from The Pile training set (Gao et al., 2020), on the which popular open-source models such as the Pythia (Biderman et al., 2023) and GPT-Neo2 model series are pre-trained. MIMIR has become widely used in MIA research with LLMs. OLMo MIA (Kim et al., 2024) introduces a membership inference dataset centered around the open-source OLMo-7B LLM (Groeneveld et al., 2024). ... Dolma Book (Zhang & Wu, 2024) also leverages the publicly available pre-training data from OLMo, and samples non-member books from books from the Project Gutenberg3 dated after January 1st, 2024.
Dataset Splits	No	The paper is a survey of existing contamination detection methods and does not present its own experimental work with datasets requiring explicit training/test/validation splits. It discusses dataset splitting strategies (e.g., randomized train-test split, time-based splits) in the context of other research papers, but does not define or use them for its own contributions.
Hardware Specification	No	The paper provides a comprehensive survey of contamination detection methods in Large Language Models. It focuses on analyzing existing research and does not describe any specific hardware used for conducting experiments or for its own contributions.
Software Dependencies	No	The paper is a survey of existing contamination detection methods and does not describe specific software dependencies or versions used for any experimental setup or implementation related to its own contributions.
Experiment Setup	No	The paper is a comprehensive survey of contamination detection methods in Large Language Models. It analyzes and categorizes existing research but does not describe any specific experimental setups, hyperparameters, or training configurations for its own work.