reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fine-tuning can Help Detect Pretraining Data from Large Language Models

Authors: Hengxiang Zhang, Songxin Zhang, Bingyi Jing, Hongxin Wei

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the effectiveness of our method, significantly improving the AUC score on common benchmark datasets across various models. To validate the effectiveness of our method, we conduct extensive experiments on various datasets, including Wiki MIA, Book MIA (Shi et al., 2024), Ar Xiv Tection, Book Tection (Duarte et al., 2024) and Pile (Maini et al., 2024). The results demonstrate that our method can significantly improve the performance of existing methods based on scoring functions.
Researcher Affiliation	Academia	Hengxiang Zhang1, Songxin Zhang1, Bingyi Jing1, Hongxin Wei1 1Department of Statistics and Data Science, Southern University of Science and Technology
Pseudocode	No	The paper describes the Fine-tuned Score Deviation (FSD) method through prose and mathematical formulations (Equations 4 and 5) but does not include a distinct pseudocode or algorithm block.
Open Source Code	Yes	Our code is available at https://github.com/ml-stat-Sustech/ Fine-tuned-Score-Deviation.
Open Datasets	Yes	To verify the effectiveness of detection methods, we employ four common benchmark datasets for evaluations, including Wiki MIA (Shi et al., 2024), Ar Xiv Tection (Duarte et al., 2024), Book Tection (Duarte et al., 2024) Book MIA (Shi et al., 2024) and Pile (Maini et al., 2024). Previous works have demonstrated that model developers commonly use text content among those datasets for pre-training (Shi et al., 2024; Duarte et al., 2024; Ye et al., 2024). The datasets are provided by Hugging Face3, and detailed information of datasets is presented in Appendix B.
Dataset Splits	Yes	For constructing the non-member dataset, we randomly sample 30% of the data from the entire dataset and select all non-members from this subset as the constructed fine-tuning dataset. The remaining 70% of the dataset is used for testing. We conduct experiments of copyrighted book detection on Book MIA and Book Tection, we randomly sample 30% of the dataset and select all non-members from this subset as the fine-tuning dataset. Subsequently, we randomly sample 500 members and non-members from the remaining 70% of the datasets, constructing a balanced validation set of 1,000 examples for evaluation. The detailed information of the constructed dataset is shown in Table 8 and Table 9.
Hardware Specification	Yes	We conduct all experiments on NVIDIA L40 GPU and implement all methods with default parameters using Py Torch (Paszke et al., 2019).
Software Dependencies	No	We conduct all experiments on NVIDIA L40 GPU and implement all methods with default parameters using Py Torch (Paszke et al., 2019). The paper mentions PyTorch and cites its paper but does not specify a version number for PyTorch or any other software dependency.
Experiment Setup	Yes	We employ Lo RA (Hu et al., 2022) to fine-tune the base model with 3 epochs and a batch size of 8. We set the initial learning rate as 0.001 and drop it by cosine scheduling strategy.