Safety Alignment Can Be Not Superficial With Explicit Safety Signals

Authors: Jianwei Li, Jung-Eun Kim

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section describes the experimental setup of the main experiments first, including the base models, datasets, evaluation benchmarks, metrics, hyperparameter settings, and compared baselines.
Researcher Affiliation Academia 1Department of Computer Science, North Carolina State University, Raleigh, USA. Correspondence to: Jung-Eun Kim <EMAIL>.
Pseudocode No The paper describes methods in prose and visually through figures, but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We have code implementation and other information on the project website: https://sa-ess.github.io/.
Open Datasets Yes For the pretraining phase, we use the Wikipedia dataset and train the base model for three epochs (Foundation, 2024). Labels for the safety-related binary classification task are generated using Llama3-Guard. For the finetuning phase, we construct a dataset from Lima, Alpaca, and Alert: all samples from Alert are used as malicious queries and all samples from Lima as benign samples; To ensure a balance, we sample additional benign queries from Alpaca.
Dataset Splits No For the finetuning phase, we construct a balanced dataset by sampling an equal number of benign and malicious samples from existing alignment datasets. The resulting dataset contains 29,600 samples, evenly split by benign (positive) and malicious (negative) queries. While this describes the composition, specific training, validation, and test splits (e.g., 80/10/10) are not explicitly provided for reproducibility.
Hardware Specification Yes All experiments presented in this paper were carried out on a single machine, which was configured with three NVIDIA A6000 GPUs to handle the computationally intensive tasks, 256GB of memory to accommodate large-scale data processing and model training requirements, and 16 CPU cores to manage auxiliary operations.
Software Dependencies No Our codebase is built upon the Llama-Cookbook repository, serving as the foundation for implementing and evaluating our proposed methods (Meta, 2024). For the DPO models trained in our experiments, we adhered to the default settings provided by LLM-Factory to ensure consistency and comparability with prior work (Zheng et al., 2024). However, specific version numbers for these or other software dependencies are not provided.
Experiment Setup Yes Key Hyperparameters and Training Settings. While users are free to exploit different hyperparameter configurations as long as the model converges, we provide the following recommendations based on our experiments: (1) Learning Rate: For base models, we recommend using a larger learning rate, such as 2e-5, whereas for aligned models, a smaller learning rate, such as 1e-6, is preferred. (2) Training Epochs: We trained the base model for 15 epochs and the aligned model for 8 epochs. (3) Batch Size: A batch size of 72 was used, configured as 3 (number of devices) 4 (per device batch size) 6 (gradient accumulation steps). (4) Sequence Length: The max sequence length was set to 2048. (5) Other Parameters: Parameters such as the optimizer and warmup steps were not extensively tuned, and users are encouraged to try different configurations. For the hyperparameters in our approach (λ1 & λ2 in Sec. 3.1; r1, r2, & r3 in Sec. 3.2; τ in Sec. 3.3), we empirically adopt the following: r1 = r2 = r3 = 10, λ1 = 0.01, λ2 = 0.1/0.01, and τ 3.