reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On the Role of Attention Heads in Large Language Model Safety

Authors: Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, Yongbin Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments are performed on two models, i.e., Llama-2-7b-chat (Touvron et al., 2023) and Vicuna-7b-v1.5 (Zheng et al., 2024b), using three commonly used harmful query datasets: Advbench (Zou et al., 2023b), Jailbreakbench (Chao et al., 2024), and Malicious Instruct (Huang et al., 2024). After ablating the safety attention head for the specific q H, we generate an output of 128 tokens for each query to evaluate the impact on model safety. We use greedy sampling to ensure result reproducibility and top-k sampling to capture changes in the probability distributions. We use the attack success rate (ASR) metric, which is widely used to evaluate model safety (Qi et al., 2024; Zeng et al., 2024):
Researcher Affiliation	Collaboration	Zhenhong Zhou1, Haiyang Yu1, Xinghua Zhang1, Rongwu Xu3, Fei Huang1, Kun Wang2, Yang Liu4, Junfeng Fang2 , Yongbin Li1 1Tongyi Lab, 2USTC, 3Tsinghua University, 4Nanyang Technological University EMAIL EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Safety Attention Head Att Ribution Algorithm (Sahara)
Open Source Code	Yes	Our code is available at https://github.com/ydyjya/Safety Head Attribution.
Open Datasets	Yes	Our experiments are performed on two models, i.e., Llama-2-7b-chat (Touvron et al., 2023) and Vicuna-7b-v1.5 (Zheng et al., 2024b), using three commonly used harmful query datasets: Advbench (Zou et al., 2023b), Jailbreakbench (Chao et al., 2024), and Malicious Instruct (Huang et al., 2024).
Dataset Splits	No	The paper mentions using three harmful query datasets: Advbench (Zou et al., 2023b), Jailbreakbench (Chao et al., 2024), and Malicious Instruct (Huang et al., 2024). However, it does not provide specific details on how these datasets were split into training, validation, or test sets.
Hardware Specification	Yes	GPU hours refer to the runtime for full generation on one A100 80GB GPU.
Software Dependencies	No	The paper mentions "Transformers library" but does not specify a version number or any other software dependencies with their versions.
Experiment Setup	Yes	After ablating the safety attention head for the specific q H, we generate an output of 128 tokens for each query to evaluate the impact on model safety. We use greedy sampling to ensure result reproducibility and top-k sampling to capture changes in the probability distributions. Our other generation settings are as follows: when determining that ablating a head reduces safety capability, we set max new tokens=128 and temperature=1. For generation, we set max new token=128 and k=5 for top-k sampling.