reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering

Authors: Zouying Cao, Yifei Yang, Hai Zhao

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that SCANS achieves new state-of-the-art performance on XSTest and OKTest benchmarks, without impairing their defense capability against harmful queries and maintaining almost unchanged model capability.
Researcher Affiliation	Academia	Zouying Cao, Yifei Yang, Hai Zhao* Department of Computer Science and Engineering, Shanghai Jiao Tong University Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University Shanghai Key Laboratory of Trusted Data Circulation and Governance in Web3 EMAIL, EMAIL
Pseudocode	Yes	A detailed algorithm for our SCANS is presented in Appendix A2.
Open Source Code	Yes	Code https://github.com/zouyingcao/SCANS
Open Datasets	Yes	We use Adv Bench (Zou et al. 2023b) as the harmful queries and Truthful QA (Lin, Hilton, and Evans 2022) as the benign ones to generate the refusal steering vectors. ... We select XSTest (Röttger et al. 2024) and OKTest (Shi et al. 2024)... (a) Rep E-Data3 is a popular benchmark containing both harmful and harmless instructions. (b) The remaining Adv Bench consists of 456 harmful behaviors. (c) Malicious (Huang et al. 2024) constructs 100 harmful questions... We also evaluate whether SCANS would influence model capability. (a) multi-choice question answering task: we choose MMLU (Hendrycks et al. 2020)... (b) generation task: taking summarization as an example, we use XSum (Narayan, Cohen, and Lapata 2018)... Besides, we include two perplexity-based tasks, Wiki Text-2 (Merity et al. 2017) and C4 (Raffel et al. 2020). 3https://huggingface.co/datasets/justinphan3110/harmful harmless instructions
Dataset Splits	Yes	Note that we just randomly sample 64 harmful questions and 64 harmless questions to extract the steering vectors as mentioned in Section 3.1. The remaining data is utilized for safety evaluation. ... XSTest comprises 200 unsafe and 250 safe queries that well-calibrated models should not refuse. OKTest carefully designs 300 safe questions with harmful words to identify the over-refusal. We also include the remaining data from Truthful QA as the test set for helpfulness.
Hardware Specification	Yes	All experimental results are averaged across 5 trials conducted on 1x80 GB A100 GPU.
Software Dependencies	No	The paper mentions specific models (Llama2-7b-chat, Llama2-13b-chat, vicuna-7b-v1.5 and vicuna-13b-v1.5) but does not list any programming languages, libraries, or solvers with specific version numbers.
Experiment Setup	Yes	More hyperparameter settings and implementation details are in Appendix B. ... We conduct a sensitivity analysis to study the impacts of the multiplier α on refusal rate. ... We recommend setting α between 2 and 4 because too large a value sometimes results in nonsense outputs (See Appendix F.1). ... We provide the impact of threshold T on the SCANS performance in Table 7. As observed, when T is below the optimal value, more safe queries are classified as unsafe and false refusal behavior increases. However, when T exceeds the optimal level, the adequate safety may not be guaranteed. This is why we select T = 0.75 for the above comparisons on Llama2-7b-chat. Detailed settings of threshold T are given in Appendix B.2.