reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Unlocking Post-hoc Dataset Inference with Synthetic Data

Authors: Bihe Zhao, Pratyush Maini, Franziska Boenisch, Adam Dziedzic

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method s reliability for real-world litigations.
Researcher Affiliation	Collaboration	1CISPA Helmholtz Center for Information Security 2Carnegie Mellon University 3Datology AI.
Pseudocode	Yes	M. Algorithm of Our Work We present the detailed algorithms for our held-out data generation in Algorithm 1, and post-hoc calibration in Algorithm 2.
Open Source Code	Yes	Our code is available at https://github.com/s printml/Post Hoc Dataset Inference.
Open Datasets	Yes	We demonstrate the effectiveness of our approach on diverse textual datasets, ranging from single-author datasets (e.g., personal blog posts) to large-scale, multi-author collections such as Wikipedia. Our results show that using synthetic held-out data, combined with calibration, enables DI to detect unauthorized training data use with high confidence while keeping false positives low. This expands the practical applicability of DI and provides a pathway for data owners to safeguard their intellectual property in an era of LLMs.
Dataset Splits	Yes	We collect 1400 blog posts from a single author. All figures, tables, videos, and hyperlinks are removed during pre-processing and only plain text is used for evaluation. We sample 450 posts as member data and finetune a Pythia 410M deduplicated model as target model. The other posts are held out as non-member and held-out sets for the evaluation.
Hardware Specification	No	The paper does not explicitly mention specific hardware details like GPU/CPU models, memory amounts, or detailed computer specifications used for running its experiments.
Software Dependencies	No	The paper mentions using specific models like "Pythia 410M deduplicated model" and "Llama 3 8B model", and techniques like "Lo RA", but it does not provide specific version numbers for the underlying software libraries or dependencies (e.g., PyTorch version, specific LoRA library version).
Experiment Setup	Yes	The Lo RA rank for the generator is 32. The generator is trained for 100 epochs, and the learning rate is set to 2e-4. We set a warm-up ratio of 0.03, and a linear scheduler is used to dynamically adjust the learning rate.