Jailbreaking as a Reward Misspecification Problem
Authors: Zhihui Xie, Jiahui Gao, Lei Li, Zhenguo Li, Qi Liu, Lingpeng Kong
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce a metric Re Gap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present Re Miss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space. Re Miss achieves state-of-the-art attack success rates on the Adv Bench benchmark against various target aligned LLMs... We now evaluate the empirical effectiveness of Re Miss for jailbreaking. We find that Re Miss successfully generates adversarial attacks that jailbreak safety-aligned models from different developers. |
| Researcher Affiliation | Collaboration | 1The University of Hong Kong 2Huawei Noah s Ark Lab EMAIL {li.zhenguo}@huawei.com EMAIL |
| Pseudocode | Yes | Algorithm 1: Finding Reward-misspecificed Suffixes with Stochastic Beam Search Algorithm 2: Re Miss Training Pipeline |
| Open Source Code | Yes | Code is available at: https://github.com/zhxieml/remiss-jailbreak. |
| Open Datasets | Yes | We use the Adv Bench dataset (Zou et al., 2023), which comprises 520 pairs of harmful instructions and target responses. ... To evaluate out-of-distribution performance, we additionally utilize Harm Bench (Mazeika et al., 2024), which provides 320 prompts as a separate test set. |
| Dataset Splits | Yes | The dataset is split into training, validation, and test sets with a 60/20/20 ratio, as provided by Paulus et al. (2024). |
| Hardware Specification | Yes | The training process takes approximately 21 hours for 7b target models and 31 hours for 13b target models using 2 Nvidia H100s. |
| Software Dependencies | No | The paper mentions "Huggingface s transformers library (Wolf et al., 2019)" and "utilizing Lo RA (Hu et al., 2021)" but does not provide specific version numbers for these software components or any other key dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We use λ = 1 and α = 50 (except for α = 75 for Llama2-7b-chat and Llama3.1-8b-instruct). For stochastic beam search in Algorithm 1, we set the parameters as follows: sequence length l = 30, beam size n = 48, temperature τ = 0.6, and beam width b = 4. For training, we train for 10 epochs with a replay buffer size of 256 and a batch size of 8, utilizing Lo RA (Hu et al., 2021). |