A Statistical Approach for Controlled Training Data Detection
Authors: Zirui Hu, Yingjie Wang, Zheng Zhang, Hong Chen, Dacheng Tao
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical experiments on real-world datasets, such as Wiki MIA, XSum, and Real-Time BBC News, further validate KTD s superior performance compared to existing methods. |
| Researcher Affiliation | Academia | 1Generative AI Lab, College of Computing and Data Science, Nanyang Technological University 2College of Informatics, Huazhong Agricultural University |
| Pseudocode | No | The paper describes procedures and theorems (e.g., Proposition 1, Proposition 2, Theorem 1, Theorem 2, Lemma 1) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code for our experiments is available at https://github.com/huzr1999/KTD |
| Open Datasets | Yes | We conduct our experiments on three datasets: Wiki MIA (Shi et al., 2023) includes texts collected from Wikipedia events. ... XSum (Narayan et al., 2018) includes summaries of BBC news articles. ... BBC Real Time (Li et al., 2024b) includes BBC articles from January 2017 to August 2024. |
| Dataset Splits | Yes | Wiki MIA ... The dataset is separated into two disjoint parts: one corresponding to events happening before 2017 and the other to events happening after 2023. These two parts are used as training samples and non-training samples, respectively. ... XSum ... We select the test set of this dataset and randomly separate it into two parts, corresponding to training and non-training samples. ... BBC Real Time ... we use the articles published in 2017 as training samples and articles published in 2024 as non-training samples. |
| Hardware Specification | Yes | All the experiments are run with a single NVIDIA Tesla V100 32GB GPU and a 10-core Intel Xeon (Skylake IBRS) CPU. |
| Software Dependencies | No | All codes are implemented with Pytorch (Paszke et al., 2019). ... All other hyperparameters were set to the default values provided by the Training Arguments class in the Transformers library. |
| Experiment Setup | Yes | For fine-tuning, we used the following settings: warmup step = 100 weight decay = 0.01 batch size = 8 num epochs = 3 (10 for GPT-2). ... For paraphrasing, we applied the following configurations: Top-k sampling with topk = 50 Top-p sampling with topp = 0.95 Temperature scaling with temperature = 1.9. |