On the Role of Attention Heads in Large Language Model Safety
Authors: Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, Yongbin Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments are performed on two models, i.e., Llama-2-7b-chat (Touvron et al., 2023) and Vicuna-7b-v1.5 (Zheng et al., 2024b), using three commonly used harmful query datasets: Advbench (Zou et al., 2023b), Jailbreakbench (Chao et al., 2024), and Malicious Instruct (Huang et al., 2024). After ablating the safety attention head for the specific q H, we generate an output of 128 tokens for each query to evaluate the impact on model safety. We use greedy sampling to ensure result reproducibility and top-k sampling to capture changes in the probability distributions. We use the attack success rate (ASR) metric, which is widely used to evaluate model safety (Qi et al., 2024; Zeng et al., 2024): |
| Researcher Affiliation | Collaboration | Zhenhong Zhou1, Haiyang Yu1, Xinghua Zhang1, Rongwu Xu3, Fei Huang1, Kun Wang2, Yang Liu4, Junfeng Fang2 , Yongbin Li1 1Tongyi Lab, 2USTC, 3Tsinghua University, 4Nanyang Technological University EMAIL EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Safety Attention Head Att Ribution Algorithm (Sahara) |
| Open Source Code | Yes | Our code is available at https://github.com/ydyjya/Safety Head Attribution. |
| Open Datasets | Yes | Our experiments are performed on two models, i.e., Llama-2-7b-chat (Touvron et al., 2023) and Vicuna-7b-v1.5 (Zheng et al., 2024b), using three commonly used harmful query datasets: Advbench (Zou et al., 2023b), Jailbreakbench (Chao et al., 2024), and Malicious Instruct (Huang et al., 2024). |
| Dataset Splits | No | The paper mentions using three harmful query datasets: Advbench (Zou et al., 2023b), Jailbreakbench (Chao et al., 2024), and Malicious Instruct (Huang et al., 2024). However, it does not provide specific details on how these datasets were split into training, validation, or test sets. |
| Hardware Specification | Yes | GPU hours refer to the runtime for full generation on one A100 80GB GPU. |
| Software Dependencies | No | The paper mentions "Transformers library" but does not specify a version number or any other software dependencies with their versions. |
| Experiment Setup | Yes | After ablating the safety attention head for the specific q H, we generate an output of 128 tokens for each query to evaluate the impact on model safety. We use greedy sampling to ensure result reproducibility and top-k sampling to capture changes in the probability distributions. Our other generation settings are as follows: when determining that ablating a head reduces safety capability, we set max new tokens=128 and temperature=1. For generation, we set max new token=128 and k=5 for top-k sampling. |