reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Min-K%++: Improved Baseline for Pre-Training Data Detection from Large Language Models

Authors: Jingyang Zhang, Jingwei Sun, Eric Yeats, Yang Ouyang, Martin Kuo, Jianyi Zhang, Hao Yang, Hai Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, the proposed method achieves new SOTA performance across multiple settings (evaluated with 5 families of 10 models and 2 benchmarks). On the Wiki MIA benchmark, Min-K%++ outperforms the runner-up by 6.2% to 10.5% in detection AUROC averaged over five models. On the more challenging MIMIR benchmark, it consistently improves upon reference-free methods while performing on par with reference-based method that requires an extra reference model.
Researcher Affiliation	Academia	Jingyang Zhang1 , Jingwei Sun1 , Eric Yeats1, Yang Ouyang1, Martin Kuo1, Jianyi Zhang1, Hao Frank Yang1,2, Hai Li1 1Duke University 2Johns Hopkins University
Pseudocode	Yes	We show the python and pytorch-style pseudo-code above which implements Min-K%++.
Open Source Code	No	The paper provides a URL 'https://zjysteven.github.io/mink-plus-plus/' which appears to be a project page, but it does not explicitly state that the source code for the methodology is released there, nor is it a direct link to a code repository. The pseudocode in Appendix A does not constitute an open-source code release.
Open Datasets	Yes	We focus on two benchmarks (and the only two to our knowledge) for pre-training data detection, Wiki MIA (Shi et al., 2024) and MIMIR (Duan et al., 2024). MIMIR (Duan et al., 2024) is built upon the Pile dataset (Gao et al., 2020)
Dataset Splits	Yes	Wiki MIA specifically groups data into splits according to the sentence length, intending to provide a fine-grained evaluation. MIMIR (Duan et al., 2024) is built upon the Pile dataset (Gao et al., 2020), where training samples and non-training samples are drawn from the train and test split, respectively. Concretely, each input text is created by concatenating a training text at the end of a non-training text, closely simulating the representative scenario discussed above. Both the training and non-training text have random length, varying among {32, 64, 128}. In this online setting, the prediction on each part of the input, instead of on the whole input, is of interests. Therefore, we split each input into chunks with a length of 32.
Hardware Specification	No	The paper mentions various models (e.g., LLaMA, Pythia, Mamba, GPT-Neo X, OPT) with their parameter counts, but it does not specify the hardware (e.g., specific GPU or CPU models) used to run the experiments with these models.
Software Dependencies	No	Appendix A provides "python and pytorch-style pseudo-code" and implicitly uses `torch` and `numpy`, but no specific version numbers for Python, PyTorch, or NumPy are mentioned.
Experiment Setup	Yes	For all methods, we either take the recommended configuration directly from the used benchmarks (Duan et al., 2024) or choose the hyperparameters with a hold-out validation set, following Shi et al. (2024). k determines what percent of token sequences with minimum scores are chosen to compute the final score. From Figure 4, it is obvious that Min-K%++ is robust to the choice of k, with the best and the worst result being 84.8% and 82.1% (a variation of 2.7%), respectively.