On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Authors: Marcus Williams, Micah Carroll, Adhyyan Narang, Constantin Weisser, Brendan Murphy, Anca Dragan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study this phenomenon by training LLMs with Reinforcement Learning with simulated user feedback in environments of practical LLM usage. In our settings, we find that: 1) Extreme forms of "feedback gaming" such as manipulation and deception are learned reliably; 2) Even if only 2% of users are vulnerable to manipulative strategies, LLMs learn to identify and target them while behaving appropriately with other users, making such behaviors harder to detect; 3) To mitigate this issue, it may seem promising to leverage continued safety training or LLM-as-judges during training to filter problematic outputs. Instead, we found that while such approaches help in some of our settings, they backfire in others, sometimes even leading to subtler manipulative behaviors. We hope our results can serve as a case study which highlights the risks of using gameable feedback sources such as user feedback as a target for RL. Our code is publicly available. Warning: some of our examples may be upsetting.
Researcher Affiliation Collaboration Marcus Williams & Micah Carroll MATS UC Berkeley Adhyyan Narang University of Washington Constantin Weisser MATS & Haize Labs Brendan Murphy Independent Anca Dragan UC Berkeley
Pseudocode Yes Algorithm 1 Online Multi-step KTO for LLM Optimization Algorithm 2 Expert Iteration for Multi-step LLM Optimization
Open Source Code Yes Our code is publicly available. All of our code is available and documented here. We have tried to ensure that is it easy to use in order to facilitate others in building off of our experiments.
Open Datasets Yes In particular, we mix the Anthropic HH-RLHF (Bai et al., 2022a) and PKU Safe RLHF (Ji et al., 2024) datasets into each iteration s KTO training, splitting their preference comparisons into positive and negative examples.
Dataset Splits Yes As a way to simulate thumbs-up/down to use during KTO training, we select the top 1/16 trajectories as positive examples, and the bottom 1/16 as negative examples (using the trajectories reward values, which we assume would correlate with incidence of thumbs up/down).
Hardware Specification No No specific hardware details (like GPU/CPU models) are provided. The paper mentions
Software Dependencies No The paper mentions models like Claude 3.5 Sonnet, Llama-3-8B-Instruct, and GPT-4o-mini, but does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes For most runs we use the hyperparamers listed in Table 1. The exact configuration for each experiment can be viewed here: https://github.com/marcus-jw/ Targeted-Manipulation-and-Deception-in-LLMs/tree/main/targeted_ llm_manipulation/config/experiment_configs. Hyperparameter Value Number of states to sample per environment 160 Number of trajectories to sample per initial state 1 Fraction of selected trajectories 1 16 User feedback model length penalty 2.0e-5 Number of training epochs 1 Effective batch size 16 Learning rate 2.0e-5 LR decay per iteration 0.9 KTO Beta 0.1 KTO Target ratio 1.05