Adversarial Reasoning at Jailbreaking Time

Authors: Mahdi Sabbaghi, Paul Kassianik, George J. Pappas, Amin Karbasi, Hamed Hassani

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, we show that our method achieves state-of-the-art success rates among semantic-space attacks and outperforms token-space methods for many target LLMs, particularly those that have been adversarially trained (see Table 1).
Researcher Affiliation Collaboration 1University of Pennsylvania 2Robust Intelligence @ Cisco. Correspondence to: Mahdi Sabagghi <EMAIL>, Paul Kassianik <EMAIL>.
Pseudocode Yes Algorithm 1 Adversarial Reasoning Require: Initial prompt S(0), jailbreaking goal I, desired answer y I, Target model T, loss function LT, Attacker A, Feedback LLM F, Refiner LLM R. Parameters: Number of children m, Buffer size B, Number of attacking prompts n, Max iterations T. 1: Initialize buffer L {S(0)} with size B 2: for t = 1 to T do 3: Select node S arg max S L V(S) 4: Generate n attacking prompt Pi A(S ) and sort them according to losses LT(Pi, y I) 5: Generate feedbacks F = {F1, , Fm} F([P1, P2, , Pn]) 6: Remove S from L 7: for feedback F in F do 8: Create child node ˆS R(S , F) 9: Evaluate ˆS by V( ˆS) 10: Insert ˆS into L if buffer not full or better than worst in L 11: return Best node from L
Open Source Code Yes Code is available at Github.
Open Datasets Yes We test our algorithm on 50 uniformly sampled tasks selected from standard behaviors in the Harmbench dataset (Mazeika et al., 2024).
Dataset Splits No The paper mentions using "50 uniformly sampled tasks selected from standard behaviors in the Harmbench dataset", but it does not specify explicit training, validation, or test splits with percentages or counts for their own experimental setup, nor does it refer to a standard split from the dataset itself with specific attribution.
Hardware Specification Yes We used one NVIDIA A100 for our experiments.
Software Dependencies No The paper mentions using LLM models such as 'Vicuna-13b-v1.5' and 'Mixtral-8x7B-v0.1' and tools like 'Harm Bench judge', but it does not provide specific version numbers for software libraries, frameworks, or programming languages used in the implementation.
Experiment Setup Yes Unless otherwise specified, we set the temperature of the target model to 0. We execute our algorithm for T = 15 iterations per task. At each iteration, we query the current reasoning string in n = 16 separate streams to obtain the attacking prompts. For feedback generation, we use bucket size k = 2 and we generate m = 8 feedbacks. For each generated feedback we will have m = 8 new reasoning string candidates that will be added to the buffer. The buffer size for the list of candidate reasoning strings is B = 32.