Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron
Authors: Yiran Zhao, Wenxuan Zhang, Yuxi Xie, Anirudh Goyal, Kenji Kawaguchi, Michael Qizhe Shieh
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that SN-Tune not only enhances the safety mechanism for instruction-tuned models but also establishes safety mechanism for base models without compromising their general capabilities. Notably, it reduces the average harmful scores of Llama3-8B-Instruction from 65.5 to 2.0, Mistral-7B-Instruct-v0.2 from 70.8 to 4.5, and Vicuna-13B-1.5 from 93.5 to 3.0. |
| Researcher Affiliation | Collaboration | 1 National University of Singapore 2 Singapore University of Technology and Design 3 Google Deep Mind |
| Pseudocode | No | The paper describes methods using equations and diagrams (e.g., Figure 1 for SN-Tune steps and equations 1-14 for neuron detection), but it does not contain an explicitly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | 1Our code is publicly available at https://github.com/zhaoyiran924/Safety-Neuron. |
| Open Datasets | Yes | The harmful score is evaluated using the harmful behavior dataset (Zou et al., 2023), by averaging the Attack Success Rate (ASR) across various adversarial attacking methods, including Direct Attack, GCG (Zou et al., 2023), Auto DAN (Liu et al., 2024) and PAIR (Chao et al., 2023). Concurrently, we assess the models general capabilities using representative NLP tasks including MMLU (Hendrycks et al., 2020), ARCChallenge (Clark et al., 2018), and GSM8K (Cobbe et al., 2021). [...] The foundation neurons are detected by Wikipedia corpus2 with the same neuron detection method illustrated in Section 2.1. 2https://huggingface.co/datasets/wikimedia/wikipedia |
| Dataset Splits | Yes | The harmful corpus set used to detect safety neurons is constructed from the training set split in Zou et al. (2024). More details are illustrated in Appendix A.2. [...] Specifically, we use a dataset of 50 documents where the model refuses to answer harmful questions, train for only 1 epoch, and set the initial learning rate to 1e 6. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'Harm Bench implementation (Mazeika et al., 2024)' but does not provide specific version numbers for any software, libraries, or frameworks used for the experiments. |
| Experiment Setup | Yes | Specifically, we use a dataset of 50 documents where the model refuses to answer harmful questions, train for only 1 epoch, and set the initial learning rate to 1e 6. [...] The best performance is achieved with a single epoch, aligning with other continue-train approaches (Dou et al., 2024; Zhang et al., 2024). Additionally, higher learning rates lead to overfitting, resulting in both harmful score and general capabilities dropping to 0.0, while lower rates fail to effectively train safety into the model. Consequently, a learning rate of 10 6 emerges as the optimal balance between low harmful score and high general capability. |