Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

Authors: Sravanti Addepalli, Yerram Varun, Arun Suggala, Karthikeyan Shanmugam, Prateek Jain

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate that Re G-QA not only improves the diversity of the generated questions but is also highly effective in bypassing safety mechanisms. In particular, using Re G-QA, we obtain an attack success rate (ASR) of 82% on GPT-4 and 93% on GPT-3.5, which is comparable to/better than leading adversarial attack methods on Jailbreak Bench.
Researcher Affiliation Industry Sravanti Addepalli Google Deep Mind Varun Yerram Google Deep Mind Arun Suggala Google Deep Mind Karthikeyan Shanmugam Google Deep Mind Prateek Jain Google Deep Mind
Pseudocode Yes Algorithm 1 formalizes the procedure. First, an unaligned LLM, denoted as LLM U Q >A, generates a diverse set of answers A from a given seed question q (Line 2). Algorithm 1 Reponse Guided Question Augmentation Re G-QA
Open Source Code No The paper does not provide an explicit statement about releasing its own source code, nor does it provide a direct link to a code repository for the methodology described.
Open Datasets Yes We benchmark the performance of the proposed methods on Jailbreak Bench (Chao et al., 2024) 1, which is a publicly available dataset. The seed prompts are composed of 100 distinct misuse behaviours divided into 10 categories, with 55% original examples and remaining sourced from Adv Bench (Zou et al., 2023) and Harm Bench (Mazeika et al., 2024).
Dataset Splits No The paper states it uses seed prompts from Jailbreak Bench (100 distinct misuse behaviors) and a Judge Comparison dataset (300 human annotated questions), but it does not specify any training/test/validation splits for its own experiments using these datasets. It only describes the composition of the benchmark datasets.
Hardware Specification No The paper mentions various LLM models used (e.g., GPT-3.5-Turbo-1106, GPT-4o, Gemma2-27B-IT, Palm-2-Otter) and that they are publicly API accessible. However, it does not specify the underlying hardware (e.g., GPU models, CPU types, memory) on which the experiments or model inferences were run by the authors.
Software Dependencies No The paper mentions using specific LLMs (e.g., GPT-4o, Palm-2-Otter) and implies the use of Python for generating lists, but it does not list any other ancillary software components with specific version numbers (e.g., libraries, frameworks like PyTorch or TensorFlow, or system software).
Experiment Setup Yes In our ASR evaluations presented in Table 1, target models have temperature of 1 which is the default setting for gpt-4 and gpt-3.5. We use this to mimic the realistic setting of usage through external APIs. We would like to highlight that this is different from the standard jailbreak evaluations, which use temperature 0 for the target model for reproducibility (Chao et al., 2024). We prompt the target model with the same question 4 times, and ensure it produces a toxic response as evaluated by Mjudge at least 3 times.