Feint and Attack: Jailbreaking and Protecting LLMs via Attention Distribution Modeling
Authors: Rui Pu, Chaozhuo Li, Rui Ha, Zejian Chen, Litian Zhang, Zheng Liu, Lirong Qiu, Zaisheng Ye
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate the superiority of our proposal when compared to SOTA baselines. Our proposal is extensively evaluated on popular datasets, demonstrating superior performance compared to existing SOTA baselines. |
| Researcher Affiliation | Academia | 1 Beijing University of Posts and Telecommunications 2Hangzhou Dianzi University 3Beijing Academy of Artificial Intelligence 4Fujian Cancer Hospital EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the proposed methods, Attention-Based Attack (ABA) and Attention-Based Defense (ABD), in detail through descriptive text and a high-level diagram (Figure 2), but does not contain any formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about making its source code publicly available, nor does it provide any links to a code repository. |
| Open Datasets | Yes | Following previous work [Jiang et al., 2025], two main datasets are adopted: Adv Bench Subset [Chao et al., 2024], and Harm Bench [Mazeika et al., 2024]. Adv Bench Subset is used to evaluate the effectiveness of ABA and ABD, while Harm Bench supplements the evaluation of ABA. |
| Dataset Splits | No | The paper mentions using Adv Bench and Harm Bench datasets, and states that Adv Bench includes '520 malicious prompts', but it does not specify any training, validation, or test splits (e.g., percentages or absolute counts) for these datasets. |
| Hardware Specification | No | The paper discusses models like "Llama2-7B", "Llama2-13B", "Llama3-8B", "GPT-4", "Claude-3-haiku" as target LLMs for evaluation. However, it does not provide specific details about the hardware used to conduct the authors' experiments, such as GPU models, CPU types, or memory configurations. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers, such as programming languages, libraries, or frameworks used for implementation. |
| Experiment Setup | No | The paper describes the general methodology for ABA and ABD and discusses the selection of parameter β and the number of nested layers, but it does not provide specific values for other common experimental hyperparameters such as learning rates, batch sizes, or optimizer settings in the main text. |