reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron

Authors: Yiran Zhao, Wenxuan Zhang, Yuxi Xie, Anirudh Goyal, Kenji Kawaguchi, Michael Qizhe Shieh

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that SN-Tune not only enhances the safety mechanism for instruction-tuned models but also establishes safety mechanism for base models without compromising their general capabilities. Notably, it reduces the average harmful scores of Llama3-8B-Instruction from 65.5 to 2.0, Mistral-7B-Instruct-v0.2 from 70.8 to 4.5, and Vicuna-13B-1.5 from 93.5 to 3.0.
Researcher Affiliation	Collaboration	1 National University of Singapore 2 Singapore University of Technology and Design 3 Google Deep Mind
Pseudocode	No	The paper describes methods using equations and diagrams (e.g., Figure 1 for SN-Tune steps and equations 1-14 for neuron detection), but it does not contain an explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	1Our code is publicly available at https://github.com/zhaoyiran924/Safety-Neuron.
Open Datasets	Yes	The harmful score is evaluated using the harmful behavior dataset (Zou et al., 2023), by averaging the Attack Success Rate (ASR) across various adversarial attacking methods, including Direct Attack, GCG (Zou et al., 2023), Auto DAN (Liu et al., 2024) and PAIR (Chao et al., 2023). Concurrently, we assess the models general capabilities using representative NLP tasks including MMLU (Hendrycks et al., 2020), ARCChallenge (Clark et al., 2018), and GSM8K (Cobbe et al., 2021). [...] The foundation neurons are detected by Wikipedia corpus2 with the same neuron detection method illustrated in Section 2.1. 2https://huggingface.co/datasets/wikimedia/wikipedia
Dataset Splits	Yes	The harmful corpus set used to detect safety neurons is constructed from the training set split in Zou et al. (2024). More details are illustrated in Appendix A.2. [...] Specifically, we use a dataset of 50 documents where the model refuses to answer harmful questions, train for only 1 epoch, and set the initial learning rate to 1e 6.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments.
Software Dependencies	No	The paper mentions using 'Harm Bench implementation (Mazeika et al., 2024)' but does not provide specific version numbers for any software, libraries, or frameworks used for the experiments.
Experiment Setup	Yes	Specifically, we use a dataset of 50 documents where the model refuses to answer harmful questions, train for only 1 epoch, and set the initial learning rate to 1e 6. [...] The best performance is achieved with a single epoch, aligning with other continue-train approaches (Dou et al., 2024; Zhang et al., 2024). Additionally, higher learning rates lead to overfitting, resulting in both harmful score and general capabilities dropping to 0.0, while lower rates fail to effectively train safety into the model. Consequently, a learning rate of 10 6 emerges as the optimal balance between low harmful score and high general capability.