Feint and Attack: Jailbreaking and Protecting LLMs via Attention Distribution Modeling

Authors: Rui Pu, Chaozhuo Li, Rui Ha, Zejian Chen, Litian Zhang, Zheng Liu, Lirong Qiu, Zaisheng Ye

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the superiority of our proposal when compared to SOTA baselines. Our proposal is extensively evaluated on popular datasets, demonstrating superior performance compared to existing SOTA baselines.
Researcher Affiliation Academia 1 Beijing University of Posts and Telecommunications 2Hangzhou Dianzi University 3Beijing Academy of Artificial Intelligence 4Fujian Cancer Hospital EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the proposed methods, Attention-Based Attack (ABA) and Attention-Based Defense (ABD), in detail through descriptive text and a high-level diagram (Figure 2), but does not contain any formal pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about making its source code publicly available, nor does it provide any links to a code repository.
Open Datasets Yes Following previous work [Jiang et al., 2025], two main datasets are adopted: Adv Bench Subset [Chao et al., 2024], and Harm Bench [Mazeika et al., 2024]. Adv Bench Subset is used to evaluate the effectiveness of ABA and ABD, while Harm Bench supplements the evaluation of ABA.
Dataset Splits No The paper mentions using Adv Bench and Harm Bench datasets, and states that Adv Bench includes '520 malicious prompts', but it does not specify any training, validation, or test splits (e.g., percentages or absolute counts) for these datasets.
Hardware Specification No The paper discusses models like "Llama2-7B", "Llama2-13B", "Llama3-8B", "GPT-4", "Claude-3-haiku" as target LLMs for evaluation. However, it does not provide specific details about the hardware used to conduct the authors' experiments, such as GPU models, CPU types, or memory configurations.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers, such as programming languages, libraries, or frameworks used for implementation.
Experiment Setup No The paper describes the general methodology for ABA and ABD and discusses the selection of parameter β and the number of nested layers, but it does not provide specific values for other common experimental hyperparameters such as learning rates, batch sizes, or optimizer settings in the main text.