reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving LLM Safety Alignment with Dual-Objective Optimization

Authors: Xuandong Zhao, Will Cai, Tianneng Shi, David Huang, Licong Lin, Song Mei, Dawn Song

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our proposed alignment methods on two open-source LLMs Gemma-2-2B (Team, 2024a) and Llama3-8B (Team, 2024b) across safety and utility metrics. Our empirical evaluations demonstrate that DOOR and WDOOR significantly enhance resilience against a variety of jailbreak techniques. Extensive testing reveals substantial reductions in attack success rates, particularly in prefilling and suffix-based adversarial settings. Furthermore, our training methodology exhibits strong generalization capabilities, maintaining robustness across both in-distribution and outof-distribution safety scenarios.
Researcher Affiliation	Academia	1University of California, Berkeley. Correspondence to: Xuandong Zhao <EMAIL>, Will Cai <EMAIL>.
Pseudocode	No	The paper describes the methodology using mathematical formulations and textual descriptions, but it does not include any explicitly labeled pseudocode or algorithm blocks. Figure 1 illustrates the framework, but it is a diagram, not pseudocode.
Open Source Code	Yes	The code is available at https://github.com/wicai24/DOOR-Alignment.
Open Datasets	Yes	Training Data. Our safety alignment dataset consists of (1) safe data with desirable responses, (2) harmful data with undesirable responses, and (3) general utility data. Safetyrelated data comes from SORRY-Bench (Xie et al., 2024a) and HEx-PHI (Qi et al., 2023), covering diverse harmful instructions. ... Utility data is sampled from Alpaca (Taori et al., 2023). Evaluation Data. We assess safety using: (1) SORRY-Bench (held-out subset)... (3) Harm Bench (Mazeika et al., 2024)... Over-conservatism is measured using XSTest (Röttger et al., 2024), while general capabilities are evaluated with MMLU (Hendrycks et al., 2021) and Hella Swag (Zellers et al., 2019).
Dataset Splits	Yes	To construct the training set, we first fine-tuned a separate model to generate undesirable (harmful) responses. This jailbroken model was trained using a subset of 110 samples from HEx-PHI (10 samples per category)... Subsequently, we generated desirable (safe) and undesirable (harmful) responses for a subset of the evaluation data from SORRY-Bench (180 samples, 4 per category) and HEx-PHI (220 samples, 20 per category). ...We randomly sampled 400 examples from the cleaned version of Alpaca to represent general utility data... From the original 180 SORRY-Bench evaluation set, we curated approximately 100 multi-turn harmful interactions for each model... Harm Bench [...] contains 400 harmful behaviors... We measure the Over-Rejection Rate on 350 safe queries from XSTest.
Hardware Specification	Yes	All models are trained for 10 epochs on NVIDIA H100 GPUs with a batch size of 2...
Software Dependencies	No	The paper mentions 'Adam W' as the optimizer and 'bfloat16 precision' but does not specify software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or their version numbers.
Experiment Setup	Yes	Training Setup. All models are trained for 10 epochs on NVIDIA H100 GPUs with a batch size of 2, gradient accumulation of 1, and a learning rate of 1e-5. We use Adam W with bfloat16 precision and a sequence length of 512. For alignment methods, we set β = 0.5 and α = 0.2, except for SFT, which does not use β.