reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TDDBench: A Benchmark for Training data detection

Authors: Zhihao Zhu, Yi Yang, Defu Lian

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we introduce TDDBench, which consists of 13 datasets spanning three data modalities: image, tabular, and text. We benchmark 21 different TDD methods across four detection paradigms and evaluate their performance from five perspectives: average detection performance, best detection performance, memory consumption, and computational efficiency in both time and memory. Our extensive experiments also reveal the generally unsatisfactory performance of TDD algorithms across different datasets.
Researcher Affiliation	Academia	Zhihao Zhu1,2 Yi Yang2B Defu Lian1B 1University of Science and Technology of China 2The Hong Kong University of Science and Technology EMAIL, EMAIL, EMAIL
Pseudocode	Yes	By following these steps in Alg 1 and Alg 2, you can effectively implement both Model-based and Query-based TDD algorithms. Algorithm 1: How to train reference models in Model-based TDD algorithms Algorithm 2: How to obtain extra queries in Query-based TDD algorithms
Open Source Code	Yes	To enhance accessibility and reproducibility, we open-source TDDBench for the research community at https://github.com/zzh9568/TDDBench.
Open Datasets	Yes	TDDBench consists of 13 datasets across three data modalities: image, tabular, and text. It includes three data modalities: image, tabular, and text. TDDBench incorporates datasets commonly used to evaluate TDD algorithms in previous literatures (Truex et al., 2019; Hui et al., 2021), such as CIFAR-10 and Purchase. We also compile new datasets that potentially contain private or copyright-sensitive information, including Celeb A (human faces), Blood MNIST (medical), Adult (personal income), and Tweet (social networks), which are more likely to necessitate TDD for tasks like copyright verification and unlearning confirmation. Additionally, WIKIMIA is a dataset specifically designed to evaluate TDD algorithms on large language models.
Dataset Splits	Yes	Specifically, given a dataset in TDDBench, we divide the dataset into a target dataset and an auxiliary dataset in a 50:50 ratio. For the target dataset, we further split it into two halves, where the first half serves as the training dataset to train the target model (e.g., an image classifier), and the remaining half is not used in training the target model. Therefore, the training dataset serves as the positive examples for training data detection, while the remaining data serves as the negative examples.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running the experiments. It only mentions 'Due to limitations in computing resources' regarding large model TDD.
Software Dependencies	No	The paper mentions using the 'Adam optimizer' and 'AdamW' as optimizers in Section 3.1 and Table 15, respectively, but does not provide specific version numbers for these or any other core software dependencies (e.g., Python, PyTorch, TensorFlow, scikit-learn libraries).
Experiment Setup	Yes	For the learning-based TDD methods, we construct a two-layer neural network with 64 and 32 hidden units as the auxiliary classifier. The learning rate is set to 0.001, using the Adam optimizer, and training continues until the validation accuracy does not improve for 30 epochs or until a maximum of 500 epochs is reached. For the model-based TDD methods, we train 16 reference models. Finally, for the query-based TDD methods, including Query-neighbor, Query-augment, and Query-ref, we limit the detection algorithms to a maximum of 10 additional queries per data point. Table 15: Training details for various model architectures, including learning rate, weight decay, maximum training epochs, and more. MLP stands for Multilayer Perceptron, and LR stands for Logistic Regression. N/A indicates that the model does not require consideration of the corresponding hyperparameter.