reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Jailbreaking as a Reward Misspecification Problem

Authors: Zhihui Xie, Jiahui Gao, Lei Li, Zhenguo Li, Qi Liu, Lingpeng Kong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce a metric Re Gap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present Re Miss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space. Re Miss achieves state-of-the-art attack success rates on the Adv Bench benchmark against various target aligned LLMs... We now evaluate the empirical effectiveness of Re Miss for jailbreaking. We find that Re Miss successfully generates adversarial attacks that jailbreak safety-aligned models from different developers.
Researcher Affiliation	Collaboration	1The University of Hong Kong 2Huawei Noah s Ark Lab EMAIL {li.zhenguo}@huawei.com EMAIL
Pseudocode	Yes	Algorithm 1: Finding Reward-misspecificed Suffixes with Stochastic Beam Search Algorithm 2: Re Miss Training Pipeline
Open Source Code	Yes	Code is available at: https://github.com/zhxieml/remiss-jailbreak.
Open Datasets	Yes	We use the Adv Bench dataset (Zou et al., 2023), which comprises 520 pairs of harmful instructions and target responses. ... To evaluate out-of-distribution performance, we additionally utilize Harm Bench (Mazeika et al., 2024), which provides 320 prompts as a separate test set.
Dataset Splits	Yes	The dataset is split into training, validation, and test sets with a 60/20/20 ratio, as provided by Paulus et al. (2024).
Hardware Specification	Yes	The training process takes approximately 21 hours for 7b target models and 31 hours for 13b target models using 2 Nvidia H100s.
Software Dependencies	No	The paper mentions "Huggingface s transformers library (Wolf et al., 2019)" and "utilizing Lo RA (Hu et al., 2021)" but does not provide specific version numbers for these software components or any other key dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We use λ = 1 and α = 50 (except for α = 75 for Llama2-7b-chat and Llama3.1-8b-instruct). For stochastic beam search in Algorithm 1, we set the parameters as follows: sequence length l = 30, beam size n = 48, temperature τ = 0.6, and beam width b = 4. For training, we train for 10 epochs with a replay buffer size of 256 and a batch size of 8, utilizing Lo RA (Hu et al., 2021).