Scalable Extraction of Training Data from Aligned, Production Language Models

Authors: Milad Nasr, Javier Rando, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher Choquette-Choo, Florian Tramer, Katherine Lee

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we demonstrate the first large-scale, training-data extraction attacks on proprietary language models using only publicly-available tools and relatively little resources (under $300 total). These attacks were developed in late 2023 and early 2024, and were successful for the model versions of Chat GPT deployed at the time we conducted our experiments.
Researcher Affiliation Collaboration 1Google Deep Mind 2ETH Zurich 3University of Washington 4Cornell University
Pseudocode No The paper describes methodologies and experimental procedures in narrative text and through figures and tables, but it does not contain any clearly labeled pseudocode blocks or algorithms.
Open Source Code No The paper mentions using open-source tools and models (e.g., LLaMA2, AUXDATASET which is built from public datasets, and refers to implementations from previous works like Lee et al. (2022) for suffix arrays), but it does not provide an explicit statement or a direct link to the source code for the novel attack methodologies (divergence or finetuning attacks) developed in this paper.
Open Datasets Yes To verify the success of our attack, we construct a 9 terabyte dataset (AUXDATASET), combining many sources of Internet text, to serve as a proxy for the unknown training datasets for these production models. Through the use of efficient search algorithms, we can identify potential training data from any model generation. This corpus, which we call AUXDATASET, is the largest public index of LLM training data to date (9 terabytes). We then approximate an internet-wide search by performing a local search over this corpus. We implement a suffix array for efficient search over AUXDATASET. (See Appendix A.5 and Lee et al. (2022) for details.)
Dataset Splits Yes We finetune gpt-3.5-turbo and gpt-4 on two datasets with 1,000 samples each: (1) PILESUBSET: 1,000 documents sampled from The Pile (Gao et al., 2020) and (2) DIVERGENTSUBSET: 1,000 memorized strings extracted with our divergence attack. (See Appendix A.7 for more details about these datasets.) We use the first N tokens (for a random N [4, 6]) of each example as the user prompt, and use the entire text as the desired model completion. We evaluate targeted extraction on two datasets: (1) a heldout set of memorized strings from DIVERGENTSUBSET that was not used for finetuning; and (2) several open-source datasets that might be part of Open AI s unreleased training data.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to conduct the experiments. It mentions using
Software Dependencies No The paper mentions using specific tools and APIs like the 'Open AI API', 'Perspective API', and 'regex classifiers from Subramani et al.', but it does not specify any version numbers for these software components or other libraries used in their implementation.
Experiment Setup Yes Specifically, we prompt each LLM with a collection of random snippets of five tokens sampled from Wikipedia,2 until we collect one billion output tokens per model. We finetune gpt-3.5-turbo and gpt-44 on two datasets with 1,000 samples each: (1) PILESUBSET: 1,000 documents sampled from The Pile (Gao et al., 2020) and (2) DIVERGENTSUBSET: 1,000 memorized strings extracted with our divergence attack. We use Lo RA as a proxy for Open AI finetuning algorithms and reproduce the same experiments using the aligned LLa MA2 models as a starting point. We finetune all linear layers for 10 epochs with learning rate 0.0002.