Measuring memorization in RLHF for code completion
Authors: Jamie Hayes, I Shumailov, Billy Porter, Aneesh Pappu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find that RLHF significantly decreases the chance that data used for reward modeling and reinforcement learning is memorized in comparison to directly fine-tuning on this data, but that examples already memorized during the fine-tuning stage of RLHF, will, in the majority of cases, remain memorized after RLHF. In contrast, we find that aligning by learning directly from human preference data via a special case of ΨPO, Identity Preference Optimization (IPO), increases the likelihood that training data is regurgitated compared to RLHF. Our work suggests that RLHF, as opposed to direct preference learning, is a safer way to mitigate the risk of regurgitating sensitive preference data when aligning large language models. We find our conclusions are robust across multiple code completion datasets, tasks, and model scales. Section 5: EXPERIMENTS |
| Researcher Affiliation | Industry | Aneesh Pappu Google Deep Mind Billy Porter Google Ilia Shumailov Google Deep Mind Jamie Hayes Google Deep Mind [aneeshpappu,billyporter,jamhay]@google.com |
| Pseudocode | No | The paper includes figures (e.g., Figure 6: Comparison between reinforcement learning via a reward model and IPO), which are simplified overviews or diagrams, and code listings (e.g., Listing 1-5, 6-7), which are specific code examples, but no structured pseudocode or algorithm blocks that describe a general method or procedure. |
| Open Source Code | No | The IPO experiments were run with the open source 2B and 9B Gemma V1 models, and the RLHF experiments we run with the publicly available Gemini Nano-1 (1.8B) and T5-Base (Raffel et al., 2020) models. This indicates that the authors used existing open-source models, but does not state that they are releasing their own code for the methodology described in the paper. |
| Open Datasets | Yes | We evaluate memorization on a different code completion dataset, the Code XGLUE Code Completion-line dataset (Lu et al., 2021), in addition to the synthetic dataset, SD, used in our initial experiments. Finally, we explore RLHF memorization on natural language datasets, LIMA (Zhou et al., 2024) and Anthropic HH (Bai et al., 2022). |
| Dataset Splits | Yes | We split this dataset into two subsets, referred to as SD.Base and SD.Links... We create a reward model training dataset from SD.Links by splitting this set into a set of 1,489 positively labelled examples and 1,503 negatively labelled examples. We use the remaining 3,562 (6,554 2,992) examples from SD as examples to measure the memorization rate of full training examples. We split this dataset, referred to as PD, into three slices... The first slice, which we refer to as PD.1, has size 100,000, and is used for supervised fine-tuning... The second slice, referred to as PD.2, has 300,000 examples (250,000 training and 50,000 validation)... The third slice, referred to as PD.3, has size 125,000 (100,000 training and 25,000 validation)... |
| Hardware Specification | Yes | Each finetuning training run required 256 TPU v4. In RL fine-tuning... Each reinforcement learning training run required 256 TPU v5e. ... Each reward model training run required 5 TPU v4. |
| Software Dependencies | No | The paper mentions using specific models like "Gemma V1 models", "Gemini Nano-1 (1.8B)", and "T5-Base", and also refers to "Python comments", but it does not specify versions for any programming languages, libraries, or other software dependencies used in the implementation (e.g., Python 3.x, PyTorch 1.x). |
| Experiment Setup | Yes | In Figure 1, we plot the (normalized) edit distance between the target and model completion on these 319 examples after RL fine-tuning, where we vary the α coefficient that determines the strength of KL regularization... Table 2 details Training details Human Eval Infilling Human Eval Epochs KL penalty α (Annealed) LR Single line Multi-line. In Figure 7, we also measure how the learning rate impacts memorization of RL prompts. We fix α = 0.005 and increase the learning rate from 3e-6 to 3e-5, and measure the memorization rate after 14 epochs. |