SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering

Authors: Zouying Cao, Yifei Yang, Hai Zhao

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that SCANS achieves new state-of-the-art performance on XSTest and OKTest benchmarks, without impairing their defense capability against harmful queries and maintaining almost unchanged model capability.
Researcher Affiliation Academia Zouying Cao, Yifei Yang, Hai Zhao* Department of Computer Science and Engineering, Shanghai Jiao Tong University Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University Shanghai Key Laboratory of Trusted Data Circulation and Governance in Web3 EMAIL, EMAIL
Pseudocode Yes A detailed algorithm for our SCANS is presented in Appendix A2.
Open Source Code Yes Code https://github.com/zouyingcao/SCANS
Open Datasets Yes We use Adv Bench (Zou et al. 2023b) as the harmful queries and Truthful QA (Lin, Hilton, and Evans 2022) as the benign ones to generate the refusal steering vectors. ... We select XSTest (Röttger et al. 2024) and OKTest (Shi et al. 2024)... (a) Rep E-Data3 is a popular benchmark containing both harmful and harmless instructions. (b) The remaining Adv Bench consists of 456 harmful behaviors. (c) Malicious (Huang et al. 2024) constructs 100 harmful questions... We also evaluate whether SCANS would influence model capability. (a) multi-choice question answering task: we choose MMLU (Hendrycks et al. 2020)... (b) generation task: taking summarization as an example, we use XSum (Narayan, Cohen, and Lapata 2018)... Besides, we include two perplexity-based tasks, Wiki Text-2 (Merity et al. 2017) and C4 (Raffel et al. 2020). 3https://huggingface.co/datasets/justinphan3110/harmful harmless instructions
Dataset Splits Yes Note that we just randomly sample 64 harmful questions and 64 harmless questions to extract the steering vectors as mentioned in Section 3.1. The remaining data is utilized for safety evaluation. ... XSTest comprises 200 unsafe and 250 safe queries that well-calibrated models should not refuse. OKTest carefully designs 300 safe questions with harmful words to identify the over-refusal. We also include the remaining data from Truthful QA as the test set for helpfulness.
Hardware Specification Yes All experimental results are averaged across 5 trials conducted on 1x80 GB A100 GPU.
Software Dependencies No The paper mentions specific models (Llama2-7b-chat, Llama2-13b-chat, vicuna-7b-v1.5 and vicuna-13b-v1.5) but does not list any programming languages, libraries, or solvers with specific version numbers.
Experiment Setup Yes More hyperparameter settings and implementation details are in Appendix B. ... We conduct a sensitivity analysis to study the impacts of the multiplier α on refusal rate. ... We recommend setting α between 2 and 4 because too large a value sometimes results in nonsense outputs (See Appendix F.1). ... We provide the impact of threshold T on the SCANS performance in Table 7. As observed, when T is below the optimal value, more safe queries are classified as unsafe and false refusal behavior increases. However, when T exceeds the optimal level, the adequate safety may not be guaranteed. This is why we select T = 0.75 for the above comparisons on Llama2-7b-chat. Detailed settings of threshold T are given in Appendix B.2.