Jailbreaking as a Reward Misspecification Problem

Authors: Zhihui Xie, Jiahui Gao, Lei Li, Zhenguo Li, Qi Liu, Lingpeng Kong

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce a metric Re Gap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present Re Miss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space. Re Miss achieves state-of-the-art attack success rates on the Adv Bench benchmark against various target aligned LLMs... We now evaluate the empirical effectiveness of Re Miss for jailbreaking. We find that Re Miss successfully generates adversarial attacks that jailbreak safety-aligned models from different developers.
Researcher Affiliation Collaboration 1The University of Hong Kong 2Huawei Noah s Ark Lab EMAIL {li.zhenguo}@huawei.com EMAIL
Pseudocode Yes Algorithm 1: Finding Reward-misspecificed Suffixes with Stochastic Beam Search Algorithm 2: Re Miss Training Pipeline
Open Source Code Yes Code is available at: https://github.com/zhxieml/remiss-jailbreak.
Open Datasets Yes We use the Adv Bench dataset (Zou et al., 2023), which comprises 520 pairs of harmful instructions and target responses. ... To evaluate out-of-distribution performance, we additionally utilize Harm Bench (Mazeika et al., 2024), which provides 320 prompts as a separate test set.
Dataset Splits Yes The dataset is split into training, validation, and test sets with a 60/20/20 ratio, as provided by Paulus et al. (2024).
Hardware Specification Yes The training process takes approximately 21 hours for 7b target models and 31 hours for 13b target models using 2 Nvidia H100s.
Software Dependencies No The paper mentions "Huggingface s transformers library (Wolf et al., 2019)" and "utilizing Lo RA (Hu et al., 2021)" but does not provide specific version numbers for these software components or any other key dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We use λ = 1 and α = 50 (except for α = 75 for Llama2-7b-chat and Llama3.1-8b-instruct). For stochastic beam search in Algorithm 1, we set the parameters as follows: sequence length l = 30, beam size n = 48, temperature τ = 0.6, and beam width b = 4. For training, we train for 10 epochs with a replay buffer size of 256 and a batch size of 8, utilizing Lo RA (Hu et al., 2021).