Language Models May Verbatim Complete Text They Were Not Explicitly Trained On
Authors: Ken Liu, Christopher A. Choquette-Choo, Matthew Jagielski, Peter Kairouz, Sanmi Koyejo, Percy Liang, Nicolas Papernot
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our work, we first find that even after removing a set of extracted sequences from the training dataset and retraining the LLM from scratch, the retrained model can still verbatim complete 40% of them under our experimental conditions (Section 4). Upon investigation, we find that these removed yet still completed sequences are either de facto members of the training set (but for a different definition of membership) or lacking sufficient complexity: many examples have near duplicates, sequences with m < n-grams that are not removed, or are explained by the model s generalization capabilities (e.g., patterns or counting). |
| Researcher Affiliation | Collaboration | 1Google 2Work completed while on internship at Google Deep Mind. Now at Stanford University. 3Stanford University. Correspondence to: Ken Liu <EMAIL>, Christopher A. Choquette-Choo <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Fine-tuning sequences from Chunking ( 5.1) 1: Input: A sequence x of length n tokens, chunk size c, overlap l, random seed s 2: Output: A sequence x of with exactly one chunk from x at random position and the rest filled with random tokens 3: Set random seed to s 4: positions [ 0, (c l), 2(c l), . . . , (n l) ] (possible positions for the start of the chunk) 5: p randomly choose from positions 6: x sequence of length n tokens, initialized with placeholders 7: x[p : p + c] x[p : p + c] (copy a chunk from x, and truncate if needed) 8: for each placeholder in x do 9: replace it with a random token from the tokenizer s vocabulary 10: end for 11: return x |
| Open Source Code | No | The paper discusses the methodology but does not explicitly state that the source code for their specific methods (e.g., n-gram filtering, adversarial dataset construction techniques) is publicly available or provided. It mentions using "LLM.c (Karpathy, 2024) for an efficient pre-training pipeline" which is a third-party tool, but not their own code for the paper's novel contributions. |
| Open Datasets | Yes | Data. For all models, we use Fine Web-Edu (Penedo et al., 2024) as a state-of-the-art pre-training dataset. |
| Dataset Splits | Yes | 2. Identify verbatim completions: We then collect a set of sequences Dmem of length k that Mbase can complete verbatim (as in Def. 3.2), by checking the first k tokens of every training document in Dbase. This is a simple and effective procedure since LLMs are known to memorize training data (e.g., Carlini et al. (2022b)); other choices to obtain Dmem are also possible. 3. n-gram filtering: We then filter each sequence x Dmem away from Dbase. ... The filtered dataset is denoted as D(n) filter. |
| Hardware Specification | Yes | Compute 8 NVIDIA H100 days (1.6B parameter model) |
| Software Dependencies | Yes | We use LLM.c (Karpathy, 2024) for an efficient pre-training pipeline. |
| Experiment Setup | Yes | Table 9: Training configurations for pre-training experiments. # Training Tokens 33.6 billion Micro-Batch Size 16 Max Sequence Length 1024 Total Batch Size 220 = 1, 048, 576 tokens Gradient Accumulation Steps 8 Weight Decay 0.1 Learning Rate 6e-4 LR Schedule Cosine LR Decay decay to 10% of max LR Warmup Iterations 700 iterations Total Training Steps 32,000 |