Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron

Authors: Yiran Zhao, Wenxuan Zhang, Yuxi Xie, Anirudh Goyal, Kenji Kawaguchi, Michael Qizhe Shieh

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that SN-Tune not only enhances the safety mechanism for instruction-tuned models but also establishes safety mechanism for base models without compromising their general capabilities. Notably, it reduces the average harmful scores of Llama3-8B-Instruction from 65.5 to 2.0, Mistral-7B-Instruct-v0.2 from 70.8 to 4.5, and Vicuna-13B-1.5 from 93.5 to 3.0.
Researcher Affiliation Collaboration 1 National University of Singapore 2 Singapore University of Technology and Design 3 Google Deep Mind
Pseudocode No The paper describes methods using equations and diagrams (e.g., Figure 1 for SN-Tune steps and equations 1-14 for neuron detection), but it does not contain an explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes 1Our code is publicly available at https://github.com/zhaoyiran924/Safety-Neuron.
Open Datasets Yes The harmful score is evaluated using the harmful behavior dataset (Zou et al., 2023), by averaging the Attack Success Rate (ASR) across various adversarial attacking methods, including Direct Attack, GCG (Zou et al., 2023), Auto DAN (Liu et al., 2024) and PAIR (Chao et al., 2023). Concurrently, we assess the models general capabilities using representative NLP tasks including MMLU (Hendrycks et al., 2020), ARCChallenge (Clark et al., 2018), and GSM8K (Cobbe et al., 2021). [...] The foundation neurons are detected by Wikipedia corpus2 with the same neuron detection method illustrated in Section 2.1. 2https://huggingface.co/datasets/wikimedia/wikipedia
Dataset Splits Yes The harmful corpus set used to detect safety neurons is constructed from the training set split in Zou et al. (2024). More details are illustrated in Appendix A.2. [...] Specifically, we use a dataset of 50 documents where the model refuses to answer harmful questions, train for only 1 epoch, and set the initial learning rate to 1e 6.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments.
Software Dependencies No The paper mentions using 'Harm Bench implementation (Mazeika et al., 2024)' but does not provide specific version numbers for any software, libraries, or frameworks used for the experiments.
Experiment Setup Yes Specifically, we use a dataset of 50 documents where the model refuses to answer harmful questions, train for only 1 epoch, and set the initial learning rate to 1e 6. [...] The best performance is achieved with a single epoch, aligning with other continue-train approaches (Dou et al., 2024; Zhang et al., 2024). Additionally, higher learning rates lead to overfitting, resulting in both harmful score and general capabilities dropping to 0.0, while lower rates fail to effectively train safety into the model. Consequently, a learning rate of 10 6 emerges as the optimal balance between low harmful score and high general capability.