reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Authors: Marcus Williams, Micah Carroll, Adhyyan Narang, Constantin Weisser, Brendan Murphy, Anca Dragan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study this phenomenon by training LLMs with Reinforcement Learning with simulated user feedback in environments of practical LLM usage. In our settings, we find that: 1) Extreme forms of "feedback gaming" such as manipulation and deception are learned reliably; 2) Even if only 2% of users are vulnerable to manipulative strategies, LLMs learn to identify and target them while behaving appropriately with other users, making such behaviors harder to detect; 3) To mitigate this issue, it may seem promising to leverage continued safety training or LLM-as-judges during training to filter problematic outputs. Instead, we found that while such approaches help in some of our settings, they backfire in others, sometimes even leading to subtler manipulative behaviors. We hope our results can serve as a case study which highlights the risks of using gameable feedback sources such as user feedback as a target for RL. Our code is publicly available. Warning: some of our examples may be upsetting.
Researcher Affiliation	Collaboration	Marcus Williams & Micah Carroll MATS UC Berkeley Adhyyan Narang University of Washington Constantin Weisser MATS & Haize Labs Brendan Murphy Independent Anca Dragan UC Berkeley
Pseudocode	Yes	Algorithm 1 Online Multi-step KTO for LLM Optimization Algorithm 2 Expert Iteration for Multi-step LLM Optimization
Open Source Code	Yes	Our code is publicly available. All of our code is available and documented here. We have tried to ensure that is it easy to use in order to facilitate others in building off of our experiments.
Open Datasets	Yes	In particular, we mix the Anthropic HH-RLHF (Bai et al., 2022a) and PKU Safe RLHF (Ji et al., 2024) datasets into each iteration s KTO training, splitting their preference comparisons into positive and negative examples.
Dataset Splits	Yes	As a way to simulate thumbs-up/down to use during KTO training, we select the top 1/16 trajectories as positive examples, and the bottom 1/16 as negative examples (using the trajectories reward values, which we assume would correlate with incidence of thumbs up/down).
Hardware Specification	No	No specific hardware details (like GPU/CPU models) are provided. The paper mentions
Software Dependencies	No	The paper mentions models like Claude 3.5 Sonnet, Llama-3-8B-Instruct, and GPT-4o-mini, but does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow).
Experiment Setup	Yes	For most runs we use the hyperparamers listed in Table 1. The exact configuration for each experiment can be viewed here: https://github.com/marcus-jw/ Targeted-Manipulation-and-Deception-in-LLMs/tree/main/targeted_ llm_manipulation/config/experiment_configs. Hyperparameter Value Number of states to sample per environment 160 Number of trajectories to sample per initial state 1 Fraction of selected trajectories 1 16 User feedback model length penalty 2.0e-5 Number of training epochs 1 Effective batch size 16 Learning rate 2.0e-5 LR decay per iteration 0.9 KTO Beta 0.1 KTO Target ratio 1.05