reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling

Authors: Avery Ma, Yangchen Pan, Amir-Massoud Farahmand

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate through extensive experiments that PANDAS significantly outperforms baseline methods in long-context scenarios. Through attention analysis, we provide insights into how long-context vulnerabilities are exploited and show how PANDAS further improves upon manyshot jailbreaking. Our results show that PANDAS consistently improves ASR over other long-context baseline methods. In this section, we present results showing PANDAS s effectiveness over baseline long-context jailbreaking methods. We analyze the contribution of each PANDAS component, and evaluate performance against defended models. Through an attention analysis, we provide insights on how PANDAS improves upon MSJ.
Researcher Affiliation	Academia	1University of Toronto, Vector Institute 2University of Oxford 3Polytechnique Montr eal, Mila Quebec AI Institute, University of Toronto. Correspondence to: Avery Ma <EMAIL>.
Pseudocode	No	The paper describes the methods using mathematical notation and natural language, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks are present.
Open Source Code	Yes	Our source code is available at https://github.com/averyma/pandas.
Open Datasets	Yes	We also introduce Many Harm, a dataset of harmful question answer pairs... Dataset: We consider Adv Bench (Zou et al., 2023) and Harm Bench (Mazeika et al., 2024)...
Dataset Splits	No	The paper uses Adv Bench and Harm Bench as target prompt datasets for evaluation. It refers to using 'up to 256-shot prompts' which are in-context examples, not traditional dataset splits for training, validation, or testing of the PANDAS method itself. No specific percentages or counts for training/validation/test sets for the datasets are provided.
Hardware Specification	No	The paper mentions 'GPU memory demands' and 'substantial GPU memory required to store attention scores' but does not specify any particular GPU models, CPU models, or other specific hardware components used for experiments.
Software Dependencies	No	The paper mentions using 'the Bayesian optimization toolbox provided by Nogueira (2014)' but does not provide a specific version number for this or any other software dependencies.
Experiment Setup	Yes	We follow Anil et al. (2024) and consider a maximum shot count of 256. Following their setup, we set the number of random search iterations to 128. For PA and ND, we explore the impact of the modified demonstrations position (i.e., m in (2) and (4)) by evaluating four configurations: modifying the first demonstrations, the last demonstrations, all demonstrations, or a random subset of demonstrations. Additionally, the positive affirmation, refusal, and correction phrases are each uniformly randomly sampled from a list of 10 prompts per type... We use 5 steps of random exploration and set the total number of optimization steps to 50.