Unlocking Post-hoc Dataset Inference with Synthetic Data

Authors: Bihe Zhao, Pratyush Maini, Franziska Boenisch, Adam Dziedzic

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method s reliability for real-world litigations.
Researcher Affiliation Collaboration 1CISPA Helmholtz Center for Information Security 2Carnegie Mellon University 3Datology AI.
Pseudocode Yes M. Algorithm of Our Work We present the detailed algorithms for our held-out data generation in Algorithm 1, and post-hoc calibration in Algorithm 2.
Open Source Code Yes Our code is available at https://github.com/s printml/Post Hoc Dataset Inference.
Open Datasets Yes We demonstrate the effectiveness of our approach on diverse textual datasets, ranging from single-author datasets (e.g., personal blog posts) to large-scale, multi-author collections such as Wikipedia. Our results show that using synthetic held-out data, combined with calibration, enables DI to detect unauthorized training data use with high confidence while keeping false positives low. This expands the practical applicability of DI and provides a pathway for data owners to safeguard their intellectual property in an era of LLMs.
Dataset Splits Yes We collect 1400 blog posts from a single author. All figures, tables, videos, and hyperlinks are removed during pre-processing and only plain text is used for evaluation. We sample 450 posts as member data and finetune a Pythia 410M deduplicated model as target model. The other posts are held out as non-member and held-out sets for the evaluation.
Hardware Specification No The paper does not explicitly mention specific hardware details like GPU/CPU models, memory amounts, or detailed computer specifications used for running its experiments.
Software Dependencies No The paper mentions using specific models like "Pythia 410M deduplicated model" and "Llama 3 8B model", and techniques like "Lo RA", but it does not provide specific version numbers for the underlying software libraries or dependencies (e.g., PyTorch version, specific LoRA library version).
Experiment Setup Yes The Lo RA rank for the generator is 32. The generator is trained for 100 epochs, and the learning rate is set to 2e-4. We set a warm-up ratio of 0.03, and a linear scheduler is used to dynamically adjust the learning rate.