reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Safety Layers in Aligned Large Language Models: The Key to LLM Security

Authors: Shen Li, Liuyi Yao, Lan Zhang, Yaliang Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that the proposed approach can significantly preserve LLM security while maintaining performance and reducing computational resources compared to full fine-tuning.
Researcher Affiliation	Academia	1 University of Science and Technology of China 2 Institute of Artificial Intelligence, Hefei Comprehensive National Science Center EMAIL, liuyiyao EMAIL EMAIL, EMAIL
Pseudocode	Yes	With cosine similarity analysis, parameters scaling and the over-rejection dataset Do, our overall algorithm for precisely localizing the safety layers is as follows: Step 1. Perform the cosine similarity analysis in section 3.3 for the aligned LLM and locate the initial safety layer, denoted as the range [i, j] from the appearance of the gap to the first smoothing. Step 2. Use the over-rejection dataset Do to complete the inference and count the number of queries that the LLM refuses to answer as Ro to evaluate the tested LLM s baseline degree of over-rejection. Step 3. By selecting a scaling factor α > 1, we up-scale the parameters within layers i to j.
Open Source Code	No	The paper does not provide explicit open-source code for the methodology described. It references external resources like "alpaca finance. Huggingface. https://huggingface.co/datasets/gbharti/finance-alpaca, 2024." for datasets and "Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023." for evaluation tools, but no repository for the authors' own implementation.
Open Datasets	Yes	alpaca finance. Huggingface. https://huggingface.co/datasets/gbharti/finance-alpaca, 2024. Accessed: 2024-05-21. ... To evaluate the security performance of LLMs, we use Zou et al. (2023) s malicious problem dataset Dm (520 data)
Dataset Splits	Yes	We constructed a normal dataset (DN), an implicit attack dataset (DI), and a backdoor dataset (DB), each consisting of thousands of data entries. All these three datasets were derived from the generalized conversation dataset (alpaca finance, 2024). The ratio of backdoor data to normal data in DB is 1:1. ... In the harmful data attack scenario, we followed the harmful fine-tuning dataset setup proposed by Huang et al. (2024c), which consists of 1,000 normal data samples and 1,000 * p malicious data samples. ... we select 500 samples from the alpaca finance (2024) dataset, and ensure that they are do not overlap with the fine-tuning data as our test dataset DT .
Hardware Specification	No	The paper does not contain any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions using GPT-4 for evaluation and various LLM models (Llama-3-8B-Instruct, Llama-2-7b-chat, gemma-2b-it, Phi-3-mini-4k-instruct), but it does not specify any programming languages, libraries, or other software with their version numbers that would be needed to replicate the experiments.
Experiment Setup	Yes	The hyperparameters used for fine-tuning different LLMs are provided in Appendix A.4.2. ... Table 6 shows the hyperparameters settings for each aligned LLM in Normal, Implicit and Backdoor attack scenarios. LLa MA-3-8b-Instruct ... learning rate 1e-4 Training epoch 3 batch size 4 lr warmup steps 100 ... In the harmful data attack scenario, the initial learning rate for the tested aligned LLMs is consistently set to 1e-5.