Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

Authors: Jingyu Zhang, Ahmed Elgohary Ghoneim, Ahmed Magooda, Daniel Khashabi, Ben Van Durme

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6 EXPERIMENTS AND EMPIRICAL FINDINGS On Co SAlign-Test (Table 3), applying Co SAlign on LLAMA3.1-8BINSTRUCT and the SFT variant both significantly improves controllability measured by Co SA-Score over their respective base models. Our proposed Co SAlign method significantly outperforms all baselines, including strong cascade methods that use GPT-4o evaluator to filter out unsafe responses, in terms of overall Co SA-Score.
Researcher Affiliation Collaboration Jingyu Zhang Ahmed Elgohary Ahmed Magooda Daniel Khashabi Benjamin Van Durme Microsoft Responsible AI Research Johns Hopkins University Work done during Jingyu Zhang s internship at Microsoft. Correspondence to Jingyu Zhang {EMAIL} and Ahmed Elgohary {EMAIL}.
Pseudocode Yes Algorithm 1 Co SAlign response generation, error-scoring mechanism, and response paring
Open Source Code No Project page: https://aka.ms/controllable-safety-alignment
Open Datasets Yes We use the Beaver Tails dataset sourced from https://github.com/PKU-Alignment/Beaver Tails with Apache-2.0 license, and the Wild Guard Mix dataset sourced from https://huggingface.co/datasets/allenai/ wildguardmix with ODC-By license.
Dataset Splits Yes A.8 COSALIGN-TEST CONSTRUCTION We provide the breakdown of test prompt categories as follows, with number of prompts specified in parathesis. Seen configs: Test config: no risk allowed Allowed prompts (100): * No risk (100 prompts) Disallowed prompts (300):
Hardware Specification Yes All experiments are conducted with 4 NVIDIA A100 80GB GPUs.
Software Dependencies No The paper mentions software components like GPT-4o model, LLAMA3.1-8B-INSTRUCT, lm-evaluation-harness codebase, and Llama-Guard-3-8B, but does not provide specific version numbers for these or for core programming languages/libraries (e.g., Python, PyTorch, CUDA) required to replicate the experiment.
Experiment Setup Yes We choose hyperparameters α = 0.1, β = 3, γ = 1 to ensure α < γ < β. During training, we conduct SFT and DPO with the RMSPromp optimizer and learning rate of 5e-7, and DPO β = 0.1.