reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Language Models May Verbatim Complete Text They Were Not Explicitly Trained On

Authors: Ken Liu, Christopher A. Choquette-Choo, Matthew Jagielski, Peter Kairouz, Sanmi Koyejo, Percy Liang, Nicolas Papernot

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our work, we ﬁrst ﬁnd that even after removing a set of extracted sequences from the training dataset and retraining the LLM from scratch, the retrained model can still verbatim complete 40% of them under our experimental conditions (Section 4). Upon investigation, we ﬁnd that these removed yet still completed sequences are either de facto members of the training set (but for a different deﬁnition of membership) or lacking sufﬁcient complexity: many examples have near duplicates, sequences with m < n-grams that are not removed, or are explained by the model s generalization capabilities (e.g., patterns or counting).
Researcher Affiliation	Collaboration	1Google 2Work completed while on internship at Google Deep Mind. Now at Stanford University. 3Stanford University. Correspondence to: Ken Liu <EMAIL>, Christopher A. Choquette-Choo <EMAIL>.
Pseudocode	Yes	Algorithm 1 Fine-tuning sequences from Chunking ( 5.1) 1: Input: A sequence x of length n tokens, chunk size c, overlap l, random seed s 2: Output: A sequence x of with exactly one chunk from x at random position and the rest ﬁlled with random tokens 3: Set random seed to s 4: positions [ 0, (c l), 2(c l), . . . , (n l) ] (possible positions for the start of the chunk) 5: p randomly choose from positions 6: x sequence of length n tokens, initialized with placeholders 7: x[p : p + c] x[p : p + c] (copy a chunk from x, and truncate if needed) 8: for each placeholder in x do 9: replace it with a random token from the tokenizer s vocabulary 10: end for 11: return x
Open Source Code	No	The paper discusses the methodology but does not explicitly state that the source code for their specific methods (e.g., n-gram filtering, adversarial dataset construction techniques) is publicly available or provided. It mentions using "LLM.c (Karpathy, 2024) for an efﬁcient pre-training pipeline" which is a third-party tool, but not their own code for the paper's novel contributions.
Open Datasets	Yes	Data. For all models, we use Fine Web-Edu (Penedo et al., 2024) as a state-of-the-art pre-training dataset.
Dataset Splits	Yes	2. Identify verbatim completions: We then collect a set of sequences Dmem of length k that Mbase can complete verbatim (as in Def. 3.2), by checking the ﬁrst k tokens of every training document in Dbase. This is a simple and effective procedure since LLMs are known to memorize training data (e.g., Carlini et al. (2022b)); other choices to obtain Dmem are also possible. 3. n-gram ﬁltering: We then ﬁlter each sequence x Dmem away from Dbase. ... The ﬁltered dataset is denoted as D(n) ﬁlter.
Hardware Specification	Yes	Compute 8 NVIDIA H100 days (1.6B parameter model)
Software Dependencies	Yes	We use LLM.c (Karpathy, 2024) for an efﬁcient pre-training pipeline.
Experiment Setup	Yes	Table 9: Training conﬁgurations for pre-training experiments. # Training Tokens 33.6 billion Micro-Batch Size 16 Max Sequence Length 1024 Total Batch Size 220 = 1, 048, 576 tokens Gradient Accumulation Steps 8 Weight Decay 0.1 Learning Rate 6e-4 LR Schedule Cosine LR Decay decay to 10% of max LR Warmup Iterations 700 iterations Total Training Steps 32,000