PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
Authors: Avery Ma, Yangchen Pan, Amir-Massoud Farahmand
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate through extensive experiments that PANDAS significantly outperforms baseline methods in long-context scenarios. Through attention analysis, we provide insights into how long-context vulnerabilities are exploited and show how PANDAS further improves upon manyshot jailbreaking. Our results show that PANDAS consistently improves ASR over other long-context baseline methods. In this section, we present results showing PANDAS s effectiveness over baseline long-context jailbreaking methods. We analyze the contribution of each PANDAS component, and evaluate performance against defended models. Through an attention analysis, we provide insights on how PANDAS improves upon MSJ. |
| Researcher Affiliation | Academia | 1University of Toronto, Vector Institute 2University of Oxford 3Polytechnique Montr eal, Mila Quebec AI Institute, University of Toronto. Correspondence to: Avery Ma <EMAIL>. |
| Pseudocode | No | The paper describes the methods using mathematical notation and natural language, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks are present. |
| Open Source Code | Yes | Our source code is available at https://github.com/averyma/pandas. |
| Open Datasets | Yes | We also introduce Many Harm, a dataset of harmful question answer pairs... Dataset: We consider Adv Bench (Zou et al., 2023) and Harm Bench (Mazeika et al., 2024)... |
| Dataset Splits | No | The paper uses Adv Bench and Harm Bench as target prompt datasets for evaluation. It refers to using 'up to 256-shot prompts' which are in-context examples, not traditional dataset splits for training, validation, or testing of the PANDAS method itself. No specific percentages or counts for training/validation/test sets for the datasets are provided. |
| Hardware Specification | No | The paper mentions 'GPU memory demands' and 'substantial GPU memory required to store attention scores' but does not specify any particular GPU models, CPU models, or other specific hardware components used for experiments. |
| Software Dependencies | No | The paper mentions using 'the Bayesian optimization toolbox provided by Nogueira (2014)' but does not provide a specific version number for this or any other software dependencies. |
| Experiment Setup | Yes | We follow Anil et al. (2024) and consider a maximum shot count of 256. Following their setup, we set the number of random search iterations to 128. For PA and ND, we explore the impact of the modified demonstrations position (i.e., m in (2) and (4)) by evaluating four configurations: modifying the first demonstrations, the last demonstrations, all demonstrations, or a random subset of demonstrations. Additionally, the positive affirmation, refusal, and correction phrases are each uniformly randomly sampled from a list of 10 prompts per type... We use 5 steps of random exploration and set the total number of optimization steps to 50. |