reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

Authors: Sravanti Addepalli, Yerram Varun, Arun Suggala, Karthikeyan Shanmugam, Prateek Jain

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate that Re G-QA not only improves the diversity of the generated questions but is also highly effective in bypassing safety mechanisms. In particular, using Re G-QA, we obtain an attack success rate (ASR) of 82% on GPT-4 and 93% on GPT-3.5, which is comparable to/better than leading adversarial attack methods on Jailbreak Bench.
Researcher Affiliation	Industry	Sravanti Addepalli Google Deep Mind Varun Yerram Google Deep Mind Arun Suggala Google Deep Mind Karthikeyan Shanmugam Google Deep Mind Prateek Jain Google Deep Mind
Pseudocode	Yes	Algorithm 1 formalizes the procedure. First, an unaligned LLM, denoted as LLM U Q >A, generates a diverse set of answers A from a given seed question q (Line 2). Algorithm 1 Reponse Guided Question Augmentation Re G-QA
Open Source Code	No	The paper does not provide an explicit statement about releasing its own source code, nor does it provide a direct link to a code repository for the methodology described.
Open Datasets	Yes	We benchmark the performance of the proposed methods on Jailbreak Bench (Chao et al., 2024) 1, which is a publicly available dataset. The seed prompts are composed of 100 distinct misuse behaviours divided into 10 categories, with 55% original examples and remaining sourced from Adv Bench (Zou et al., 2023) and Harm Bench (Mazeika et al., 2024).
Dataset Splits	No	The paper states it uses seed prompts from Jailbreak Bench (100 distinct misuse behaviors) and a Judge Comparison dataset (300 human annotated questions), but it does not specify any training/test/validation splits for its own experiments using these datasets. It only describes the composition of the benchmark datasets.
Hardware Specification	No	The paper mentions various LLM models used (e.g., GPT-3.5-Turbo-1106, GPT-4o, Gemma2-27B-IT, Palm-2-Otter) and that they are publicly API accessible. However, it does not specify the underlying hardware (e.g., GPU models, CPU types, memory) on which the experiments or model inferences were run by the authors.
Software Dependencies	No	The paper mentions using specific LLMs (e.g., GPT-4o, Palm-2-Otter) and implies the use of Python for generating lists, but it does not list any other ancillary software components with specific version numbers (e.g., libraries, frameworks like PyTorch or TensorFlow, or system software).
Experiment Setup	Yes	In our ASR evaluations presented in Table 1, target models have temperature of 1 which is the default setting for gpt-4 and gpt-3.5. We use this to mimic the realistic setting of usage through external APIs. We would like to highlight that this is different from the standard jailbreak evaluations, which use temperature 0 for the target model for reproducibility (Chao et al., 2024). We prompt the target model with the same question 4 times, and ensure it produces a toxic response as evaluated by Mjudge at least 3 times.