Safety Layers in Aligned Large Language Models: The Key to LLM Security
Authors: Shen Li, Liuyi Yao, Lan Zhang, Yaliang Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that the proposed approach can significantly preserve LLM security while maintaining performance and reducing computational resources compared to full fine-tuning. |
| Researcher Affiliation | Academia | 1 University of Science and Technology of China 2 Institute of Artificial Intelligence, Hefei Comprehensive National Science Center EMAIL, liuyiyao EMAIL EMAIL, EMAIL |
| Pseudocode | Yes | With cosine similarity analysis, parameters scaling and the over-rejection dataset Do, our overall algorithm for precisely localizing the safety layers is as follows: Step 1. Perform the cosine similarity analysis in section 3.3 for the aligned LLM and locate the initial safety layer, denoted as the range [i, j] from the appearance of the gap to the first smoothing. Step 2. Use the over-rejection dataset Do to complete the inference and count the number of queries that the LLM refuses to answer as Ro to evaluate the tested LLM s baseline degree of over-rejection. Step 3. By selecting a scaling factor α > 1, we up-scale the parameters within layers i to j. |
| Open Source Code | No | The paper does not provide explicit open-source code for the methodology described. It references external resources like "alpaca finance. Huggingface. https://huggingface.co/datasets/gbharti/finance-alpaca, 2024." for datasets and "Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023." for evaluation tools, but no repository for the authors' own implementation. |
| Open Datasets | Yes | alpaca finance. Huggingface. https://huggingface.co/datasets/gbharti/finance-alpaca, 2024. Accessed: 2024-05-21. ... To evaluate the security performance of LLMs, we use Zou et al. (2023) s malicious problem dataset Dm (520 data) |
| Dataset Splits | Yes | We constructed a normal dataset (DN), an implicit attack dataset (DI), and a backdoor dataset (DB), each consisting of thousands of data entries. All these three datasets were derived from the generalized conversation dataset (alpaca finance, 2024). The ratio of backdoor data to normal data in DB is 1:1. ... In the harmful data attack scenario, we followed the harmful fine-tuning dataset setup proposed by Huang et al. (2024c), which consists of 1,000 normal data samples and 1,000 * p malicious data samples. ... we select 500 samples from the alpaca finance (2024) dataset, and ensure that they are do not overlap with the fine-tuning data as our test dataset DT . |
| Hardware Specification | No | The paper does not contain any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using GPT-4 for evaluation and various LLM models (Llama-3-8B-Instruct, Llama-2-7b-chat, gemma-2b-it, Phi-3-mini-4k-instruct), but it does not specify any programming languages, libraries, or other software with their version numbers that would be needed to replicate the experiments. |
| Experiment Setup | Yes | The hyperparameters used for fine-tuning different LLMs are provided in Appendix A.4.2. ... Table 6 shows the hyperparameters settings for each aligned LLM in Normal, Implicit and Backdoor attack scenarios. LLa MA-3-8b-Instruct ... learning rate 1e-4 Training epoch 3 batch size 4 lr warmup steps 100 ... In the harmful data attack scenario, the initial learning rate for the tested aligned LLMs is consistently set to 1e-5. |