On the Role of Attention Heads in Large Language Model Safety

Authors: Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, Yongbin Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments are performed on two models, i.e., Llama-2-7b-chat (Touvron et al., 2023) and Vicuna-7b-v1.5 (Zheng et al., 2024b), using three commonly used harmful query datasets: Advbench (Zou et al., 2023b), Jailbreakbench (Chao et al., 2024), and Malicious Instruct (Huang et al., 2024). After ablating the safety attention head for the specific q H, we generate an output of 128 tokens for each query to evaluate the impact on model safety. We use greedy sampling to ensure result reproducibility and top-k sampling to capture changes in the probability distributions. We use the attack success rate (ASR) metric, which is widely used to evaluate model safety (Qi et al., 2024; Zeng et al., 2024):
Researcher Affiliation Collaboration Zhenhong Zhou1, Haiyang Yu1, Xinghua Zhang1, Rongwu Xu3, Fei Huang1, Kun Wang2, Yang Liu4, Junfeng Fang2 , Yongbin Li1 1Tongyi Lab, 2USTC, 3Tsinghua University, 4Nanyang Technological University EMAIL EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Safety Attention Head Att Ribution Algorithm (Sahara)
Open Source Code Yes Our code is available at https://github.com/ydyjya/Safety Head Attribution.
Open Datasets Yes Our experiments are performed on two models, i.e., Llama-2-7b-chat (Touvron et al., 2023) and Vicuna-7b-v1.5 (Zheng et al., 2024b), using three commonly used harmful query datasets: Advbench (Zou et al., 2023b), Jailbreakbench (Chao et al., 2024), and Malicious Instruct (Huang et al., 2024).
Dataset Splits No The paper mentions using three harmful query datasets: Advbench (Zou et al., 2023b), Jailbreakbench (Chao et al., 2024), and Malicious Instruct (Huang et al., 2024). However, it does not provide specific details on how these datasets were split into training, validation, or test sets.
Hardware Specification Yes GPU hours refer to the runtime for full generation on one A100 80GB GPU.
Software Dependencies No The paper mentions "Transformers library" but does not specify a version number or any other software dependencies with their versions.
Experiment Setup Yes After ablating the safety attention head for the specific q H, we generate an output of 128 tokens for each query to evaluate the impact on model safety. We use greedy sampling to ensure result reproducibility and top-k sampling to capture changes in the probability distributions. Our other generation settings are as follows: when determining that ablating a head reduces safety capability, we set max new tokens=128 and temperature=1. For generation, we set max new token=128 and k=5 for top-k sampling.