Improving LLM Safety Alignment with Dual-Objective Optimization
Authors: Xuandong Zhao, Will Cai, Tianneng Shi, David Huang, Licong Lin, Song Mei, Dawn Song
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our proposed alignment methods on two open-source LLMs Gemma-2-2B (Team, 2024a) and Llama3-8B (Team, 2024b) across safety and utility metrics. Our empirical evaluations demonstrate that DOOR and WDOOR significantly enhance resilience against a variety of jailbreak techniques. Extensive testing reveals substantial reductions in attack success rates, particularly in prefilling and suffix-based adversarial settings. Furthermore, our training methodology exhibits strong generalization capabilities, maintaining robustness across both in-distribution and outof-distribution safety scenarios. |
| Researcher Affiliation | Academia | 1University of California, Berkeley. Correspondence to: Xuandong Zhao <EMAIL>, Will Cai <EMAIL>. |
| Pseudocode | No | The paper describes the methodology using mathematical formulations and textual descriptions, but it does not include any explicitly labeled pseudocode or algorithm blocks. Figure 1 illustrates the framework, but it is a diagram, not pseudocode. |
| Open Source Code | Yes | The code is available at https://github.com/wicai24/DOOR-Alignment. |
| Open Datasets | Yes | Training Data. Our safety alignment dataset consists of (1) safe data with desirable responses, (2) harmful data with undesirable responses, and (3) general utility data. Safetyrelated data comes from SORRY-Bench (Xie et al., 2024a) and HEx-PHI (Qi et al., 2023), covering diverse harmful instructions. ... Utility data is sampled from Alpaca (Taori et al., 2023). Evaluation Data. We assess safety using: (1) SORRY-Bench (held-out subset)... (3) Harm Bench (Mazeika et al., 2024)... Over-conservatism is measured using XSTest (Röttger et al., 2024), while general capabilities are evaluated with MMLU (Hendrycks et al., 2021) and Hella Swag (Zellers et al., 2019). |
| Dataset Splits | Yes | To construct the training set, we first fine-tuned a separate model to generate undesirable (harmful) responses. This jailbroken model was trained using a subset of 110 samples from HEx-PHI (10 samples per category)... Subsequently, we generated desirable (safe) and undesirable (harmful) responses for a subset of the evaluation data from SORRY-Bench (180 samples, 4 per category) and HEx-PHI (220 samples, 20 per category). ...We randomly sampled 400 examples from the cleaned version of Alpaca to represent general utility data... From the original 180 SORRY-Bench evaluation set, we curated approximately 100 multi-turn harmful interactions for each model... Harm Bench [...] contains 400 harmful behaviors... We measure the Over-Rejection Rate on 350 safe queries from XSTest. |
| Hardware Specification | Yes | All models are trained for 10 epochs on NVIDIA H100 GPUs with a batch size of 2... |
| Software Dependencies | No | The paper mentions 'Adam W' as the optimizer and 'bfloat16 precision' but does not specify software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or their version numbers. |
| Experiment Setup | Yes | Training Setup. All models are trained for 10 epochs on NVIDIA H100 GPUs with a batch size of 2, gradient accumulation of 1, and a learning rate of 1e-5. We use Adam W with bfloat16 precision and a sequence length of 512. For alignment methods, we set β = 0.5 and α = 0.2, except for SFT, which does not use β. |