Min-K%++: Improved Baseline for Pre-Training Data Detection from Large Language Models
Authors: Jingyang Zhang, Jingwei Sun, Eric Yeats, Yang Ouyang, Martin Kuo, Jianyi Zhang, Hao Yang, Hai Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, the proposed method achieves new SOTA performance across multiple settings (evaluated with 5 families of 10 models and 2 benchmarks). On the Wiki MIA benchmark, Min-K%++ outperforms the runner-up by 6.2% to 10.5% in detection AUROC averaged over five models. On the more challenging MIMIR benchmark, it consistently improves upon reference-free methods while performing on par with reference-based method that requires an extra reference model. |
| Researcher Affiliation | Academia | Jingyang Zhang1 , Jingwei Sun1 , Eric Yeats1, Yang Ouyang1, Martin Kuo1, Jianyi Zhang1, Hao Frank Yang1,2, Hai Li1 1Duke University 2Johns Hopkins University |
| Pseudocode | Yes | We show the python and pytorch-style pseudo-code above which implements Min-K%++. |
| Open Source Code | No | The paper provides a URL 'https://zjysteven.github.io/mink-plus-plus/' which appears to be a project page, but it does not explicitly state that the source code for the methodology is released there, nor is it a direct link to a code repository. The pseudocode in Appendix A does not constitute an open-source code release. |
| Open Datasets | Yes | We focus on two benchmarks (and the only two to our knowledge) for pre-training data detection, Wiki MIA (Shi et al., 2024) and MIMIR (Duan et al., 2024). MIMIR (Duan et al., 2024) is built upon the Pile dataset (Gao et al., 2020) |
| Dataset Splits | Yes | Wiki MIA specifically groups data into splits according to the sentence length, intending to provide a fine-grained evaluation. MIMIR (Duan et al., 2024) is built upon the Pile dataset (Gao et al., 2020), where training samples and non-training samples are drawn from the train and test split, respectively. Concretely, each input text is created by concatenating a training text at the end of a non-training text, closely simulating the representative scenario discussed above. Both the training and non-training text have random length, varying among {32, 64, 128}. In this online setting, the prediction on each part of the input, instead of on the whole input, is of interests. Therefore, we split each input into chunks with a length of 32. |
| Hardware Specification | No | The paper mentions various models (e.g., LLaMA, Pythia, Mamba, GPT-Neo X, OPT) with their parameter counts, but it does not specify the hardware (e.g., specific GPU or CPU models) used to run the experiments with these models. |
| Software Dependencies | No | Appendix A provides "python and pytorch-style pseudo-code" and implicitly uses `torch` and `numpy`, but no specific version numbers for Python, PyTorch, or NumPy are mentioned. |
| Experiment Setup | Yes | For all methods, we either take the recommended configuration directly from the used benchmarks (Duan et al., 2024) or choose the hyperparameters with a hold-out validation set, following Shi et al. (2024). k determines what percent of token sequences with minimum scores are chosen to compute the final score. From Figure 4, it is obvious that Min-K%++ is robust to the choice of k, with the best and the worst result being 84.8% and 82.1% (a variation of 2.7%), respectively. |