reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adversarial Reasoning at Jailbreaking Time

Authors: Mahdi Sabbaghi, Paul Kassianik, George J. Pappas, Amin Karbasi, Hamed Hassani

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally, we show that our method achieves state-of-the-art success rates among semantic-space attacks and outperforms token-space methods for many target LLMs, particularly those that have been adversarially trained (see Table 1).
Researcher Affiliation	Collaboration	1University of Pennsylvania 2Robust Intelligence @ Cisco. Correspondence to: Mahdi Sabagghi <EMAIL>, Paul Kassianik <EMAIL>.
Pseudocode	Yes	Algorithm 1 Adversarial Reasoning Require: Initial prompt S(0), jailbreaking goal I, desired answer y I, Target model T, loss function LT, Attacker A, Feedback LLM F, Refiner LLM R. Parameters: Number of children m, Buffer size B, Number of attacking prompts n, Max iterations T. 1: Initialize buffer L {S(0)} with size B 2: for t = 1 to T do 3: Select node S arg max S L V(S) 4: Generate n attacking prompt Pi A(S ) and sort them according to losses LT(Pi, y I) 5: Generate feedbacks F = {F1, , Fm} F([P1, P2, , Pn]) 6: Remove S from L 7: for feedback F in F do 8: Create child node ˆS R(S , F) 9: Evaluate ˆS by V( ˆS) 10: Insert ˆS into L if buffer not full or better than worst in L 11: return Best node from L
Open Source Code	Yes	Code is available at Github.
Open Datasets	Yes	We test our algorithm on 50 uniformly sampled tasks selected from standard behaviors in the Harmbench dataset (Mazeika et al., 2024).
Dataset Splits	No	The paper mentions using "50 uniformly sampled tasks selected from standard behaviors in the Harmbench dataset", but it does not specify explicit training, validation, or test splits with percentages or counts for their own experimental setup, nor does it refer to a standard split from the dataset itself with specific attribution.
Hardware Specification	Yes	We used one NVIDIA A100 for our experiments.
Software Dependencies	No	The paper mentions using LLM models such as 'Vicuna-13b-v1.5' and 'Mixtral-8x7B-v0.1' and tools like 'Harm Bench judge', but it does not provide specific version numbers for software libraries, frameworks, or programming languages used in the implementation.
Experiment Setup	Yes	Unless otherwise specified, we set the temperature of the target model to 0. We execute our algorithm for T = 15 iterations per task. At each iteration, we query the current reasoning string in n = 16 separate streams to obtain the attacking prompts. For feedback generation, we use bucket size k = 2 and we generate m = 8 feedbacks. For each generated feedback we will have m = 8 new reasoning string candidates that will be added to the buffer. The buffer size for the list of candidate reasoning strings is B = 32.