reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Automated Detection of Pre-training Text in Black-box LLMs

Authors: Ruihan Hu, Yu-Ming Shang, Jiankun Peng, Wei Luo, Yazhe Wang, Xi Zhang

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations on three widely used datasets demonstrate that our framework is effective and superior in the black-box setting. Section 5 is titled "Experiments" and describes the experimental setup, datasets, and results.
Researcher Affiliation	Academia	1Key Laboratory of Trustworthy Distributed Computing and Service (Mo E), Beijing University of Posts and Telecommunications, China 2Zhongguancun Laboratory, China EMAIL, EMAIL, EMAIL. The email domains 'bupt.edu.cn' and 'zgclab.edu.cn' indicate academic or public research affiliations.
Pseudocode	No	The paper describes methods using text and figures, but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Code: github.com/STAIR-BUPT/Veil Probe Corresponding Author
Open Datasets	Yes	Wiki MIA [Shi et al., 2024] consists of Wikipedia event snippets. Book Tection [Duarte et al., 2024] is a widely adopted dataset that contains 165 copyrighted books, which is expanded based on Book MIA [Shi et al., 2024]. ar Xiv Tection [Duarte et al., 2024] includes classic papers from ar Xiv.
Dataset Splits	Yes	We randomly sample approximately 50 ground-truth samples per dataset to train the prototype-based classifier, with the remaining samples serving as the texts to be detected.
Hardware Specification	Yes	In our work, all experiments are implemented on a workstation with five NVIDIA Tesla V100 32G GPUs, and Ubuntu22.04.4.
Software Dependencies	No	The paper mentions 'Ubuntu22.04.4' as the operating system, but does not provide specific versions for key software components or libraries (e.g., Python, PyTorch, CUDA, etc.) used in the experiments.
Experiment Setup	Yes	For each text to be detected, three suffixes were generated using the target LLM, with the maximum suffix length set to 512 tokens. The parameter γ is set to 10 for obtaining the perturbed text r. The p-value significance threshold is chosen from {0.001, 0.01, 0.05, 0.1} to select the critical perturbation calibration features. We randomly sample approximately 50 ground-truth samples per dataset to train the prototype-based classifier.