Unlocking Post-hoc Dataset Inference with Synthetic Data
Authors: Bihe Zhao, Pratyush Maini, Franziska Boenisch, Adam Dziedzic
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method s reliability for real-world litigations. |
| Researcher Affiliation | Collaboration | 1CISPA Helmholtz Center for Information Security 2Carnegie Mellon University 3Datology AI. |
| Pseudocode | Yes | M. Algorithm of Our Work We present the detailed algorithms for our held-out data generation in Algorithm 1, and post-hoc calibration in Algorithm 2. |
| Open Source Code | Yes | Our code is available at https://github.com/s printml/Post Hoc Dataset Inference. |
| Open Datasets | Yes | We demonstrate the effectiveness of our approach on diverse textual datasets, ranging from single-author datasets (e.g., personal blog posts) to large-scale, multi-author collections such as Wikipedia. Our results show that using synthetic held-out data, combined with calibration, enables DI to detect unauthorized training data use with high confidence while keeping false positives low. This expands the practical applicability of DI and provides a pathway for data owners to safeguard their intellectual property in an era of LLMs. |
| Dataset Splits | Yes | We collect 1400 blog posts from a single author. All figures, tables, videos, and hyperlinks are removed during pre-processing and only plain text is used for evaluation. We sample 450 posts as member data and finetune a Pythia 410M deduplicated model as target model. The other posts are held out as non-member and held-out sets for the evaluation. |
| Hardware Specification | No | The paper does not explicitly mention specific hardware details like GPU/CPU models, memory amounts, or detailed computer specifications used for running its experiments. |
| Software Dependencies | No | The paper mentions using specific models like "Pythia 410M deduplicated model" and "Llama 3 8B model", and techniques like "Lo RA", but it does not provide specific version numbers for the underlying software libraries or dependencies (e.g., PyTorch version, specific LoRA library version). |
| Experiment Setup | Yes | The Lo RA rank for the generator is 32. The generator is trained for 100 epochs, and the learning rate is set to 2e-4. We set a warm-up ratio of 0.03, and a linear scheduler is used to dynamically adjust the learning rate. |