reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Safety Alignment Can Be Not Superficial With Explicit Safety Signals

Authors: Jianwei Li, Jung-Eun Kim

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section describes the experimental setup of the main experiments first, including the base models, datasets, evaluation benchmarks, metrics, hyperparameter settings, and compared baselines.
Researcher Affiliation	Academia	1Department of Computer Science, North Carolina State University, Raleigh, USA. Correspondence to: Jung-Eun Kim <EMAIL>.
Pseudocode	No	The paper describes methods in prose and visually through figures, but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We have code implementation and other information on the project website: https://sa-ess.github.io/.
Open Datasets	Yes	For the pretraining phase, we use the Wikipedia dataset and train the base model for three epochs (Foundation, 2024). Labels for the safety-related binary classification task are generated using Llama3-Guard. For the finetuning phase, we construct a dataset from Lima, Alpaca, and Alert: all samples from Alert are used as malicious queries and all samples from Lima as benign samples; To ensure a balance, we sample additional benign queries from Alpaca.
Dataset Splits	No	For the finetuning phase, we construct a balanced dataset by sampling an equal number of benign and malicious samples from existing alignment datasets. The resulting dataset contains 29,600 samples, evenly split by benign (positive) and malicious (negative) queries. While this describes the composition, specific training, validation, and test splits (e.g., 80/10/10) are not explicitly provided for reproducibility.
Hardware Specification	Yes	All experiments presented in this paper were carried out on a single machine, which was configured with three NVIDIA A6000 GPUs to handle the computationally intensive tasks, 256GB of memory to accommodate large-scale data processing and model training requirements, and 16 CPU cores to manage auxiliary operations.
Software Dependencies	No	Our codebase is built upon the Llama-Cookbook repository, serving as the foundation for implementing and evaluating our proposed methods (Meta, 2024). For the DPO models trained in our experiments, we adhered to the default settings provided by LLM-Factory to ensure consistency and comparability with prior work (Zheng et al., 2024). However, specific version numbers for these or other software dependencies are not provided.
Experiment Setup	Yes	Key Hyperparameters and Training Settings. While users are free to exploit different hyperparameter configurations as long as the model converges, we provide the following recommendations based on our experiments: (1) Learning Rate: For base models, we recommend using a larger learning rate, such as 2e-5, whereas for aligned models, a smaller learning rate, such as 1e-6, is preferred. (2) Training Epochs: We trained the base model for 15 epochs and the aligned model for 8 epochs. (3) Batch Size: A batch size of 72 was used, configured as 3 (number of devices) 4 (per device batch size) 6 (gradient accumulation steps). (4) Sequence Length: The max sequence length was set to 2048. (5) Other Parameters: Parameters such as the optimizer and warmup steps were not extensively tuned, and users are encouraged to try different configurations. For the hyperparameters in our approach (λ1 & λ2 in Sec. 3.1; r1, r2, & r3 in Sec. 3.2; τ in Sec. 3.3), we empirically adopt the following: r1 = r2 = r3 = 10, λ1 = 0.01, λ2 = 0.1/0.01, and τ 3.