Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements
Authors: Jingyu Zhang, Ahmed Elgohary Ghoneim, Ahmed Magooda, Daniel Khashabi, Ben Van Durme
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6 EXPERIMENTS AND EMPIRICAL FINDINGS On Co SAlign-Test (Table 3), applying Co SAlign on LLAMA3.1-8BINSTRUCT and the SFT variant both significantly improves controllability measured by Co SA-Score over their respective base models. Our proposed Co SAlign method significantly outperforms all baselines, including strong cascade methods that use GPT-4o evaluator to filter out unsafe responses, in terms of overall Co SA-Score. |
| Researcher Affiliation | Collaboration | Jingyu Zhang Ahmed Elgohary Ahmed Magooda Daniel Khashabi Benjamin Van Durme Microsoft Responsible AI Research Johns Hopkins University Work done during Jingyu Zhang s internship at Microsoft. Correspondence to Jingyu Zhang {EMAIL} and Ahmed Elgohary {EMAIL}. |
| Pseudocode | Yes | Algorithm 1 Co SAlign response generation, error-scoring mechanism, and response paring |
| Open Source Code | No | Project page: https://aka.ms/controllable-safety-alignment |
| Open Datasets | Yes | We use the Beaver Tails dataset sourced from https://github.com/PKU-Alignment/Beaver Tails with Apache-2.0 license, and the Wild Guard Mix dataset sourced from https://huggingface.co/datasets/allenai/ wildguardmix with ODC-By license. |
| Dataset Splits | Yes | A.8 COSALIGN-TEST CONSTRUCTION We provide the breakdown of test prompt categories as follows, with number of prompts specified in parathesis. Seen configs: Test config: no risk allowed Allowed prompts (100): * No risk (100 prompts) Disallowed prompts (300): |
| Hardware Specification | Yes | All experiments are conducted with 4 NVIDIA A100 80GB GPUs. |
| Software Dependencies | No | The paper mentions software components like GPT-4o model, LLAMA3.1-8B-INSTRUCT, lm-evaluation-harness codebase, and Llama-Guard-3-8B, but does not provide specific version numbers for these or for core programming languages/libraries (e.g., Python, PyTorch, CUDA) required to replicate the experiment. |
| Experiment Setup | Yes | We choose hyperparameters α = 0.1, β = 3, γ = 1 to ensure α < γ < β. During training, we conduct SFT and DPO with the RMSPromp optimizer and learning rate of 5e-7, and DPO β = 0.1. |