Scalable Extraction of Training Data from Aligned, Production Language Models
Authors: Milad Nasr, Javier Rando, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher Choquette-Choo, Florian Tramer, Katherine Lee
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we demonstrate the first large-scale, training-data extraction attacks on proprietary language models using only publicly-available tools and relatively little resources (under $300 total). These attacks were developed in late 2023 and early 2024, and were successful for the model versions of Chat GPT deployed at the time we conducted our experiments. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2ETH Zurich 3University of Washington 4Cornell University |
| Pseudocode | No | The paper describes methodologies and experimental procedures in narrative text and through figures and tables, but it does not contain any clearly labeled pseudocode blocks or algorithms. |
| Open Source Code | No | The paper mentions using open-source tools and models (e.g., LLaMA2, AUXDATASET which is built from public datasets, and refers to implementations from previous works like Lee et al. (2022) for suffix arrays), but it does not provide an explicit statement or a direct link to the source code for the novel attack methodologies (divergence or finetuning attacks) developed in this paper. |
| Open Datasets | Yes | To verify the success of our attack, we construct a 9 terabyte dataset (AUXDATASET), combining many sources of Internet text, to serve as a proxy for the unknown training datasets for these production models. Through the use of efficient search algorithms, we can identify potential training data from any model generation. This corpus, which we call AUXDATASET, is the largest public index of LLM training data to date (9 terabytes). We then approximate an internet-wide search by performing a local search over this corpus. We implement a suffix array for efficient search over AUXDATASET. (See Appendix A.5 and Lee et al. (2022) for details.) |
| Dataset Splits | Yes | We finetune gpt-3.5-turbo and gpt-4 on two datasets with 1,000 samples each: (1) PILESUBSET: 1,000 documents sampled from The Pile (Gao et al., 2020) and (2) DIVERGENTSUBSET: 1,000 memorized strings extracted with our divergence attack. (See Appendix A.7 for more details about these datasets.) We use the first N tokens (for a random N [4, 6]) of each example as the user prompt, and use the entire text as the desired model completion. We evaluate targeted extraction on two datasets: (1) a heldout set of memorized strings from DIVERGENTSUBSET that was not used for finetuning; and (2) several open-source datasets that might be part of Open AI s unreleased training data. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to conduct the experiments. It mentions using |
| Software Dependencies | No | The paper mentions using specific tools and APIs like the 'Open AI API', 'Perspective API', and 'regex classifiers from Subramani et al.', but it does not specify any version numbers for these software components or other libraries used in their implementation. |
| Experiment Setup | Yes | Specifically, we prompt each LLM with a collection of random snippets of five tokens sampled from Wikipedia,2 until we collect one billion output tokens per model. We finetune gpt-3.5-turbo and gpt-44 on two datasets with 1,000 samples each: (1) PILESUBSET: 1,000 documents sampled from The Pile (Gao et al., 2020) and (2) DIVERGENTSUBSET: 1,000 memorized strings extracted with our divergence attack. We use Lo RA as a proxy for Open AI finetuning algorithms and reproduce the same experiments using the aligned LLa MA2 models as a starting point. We finetune all linear layers for 10 epochs with learning rate 0.0002. |