reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

Authors: Jingyu Zhang, Ahmed Elgohary Ghoneim, Ahmed Magooda, Daniel Khashabi, Ben Van Durme

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	6 EXPERIMENTS AND EMPIRICAL FINDINGS On Co SAlign-Test (Table 3), applying Co SAlign on LLAMA3.1-8BINSTRUCT and the SFT variant both significantly improves controllability measured by Co SA-Score over their respective base models. Our proposed Co SAlign method significantly outperforms all baselines, including strong cascade methods that use GPT-4o evaluator to filter out unsafe responses, in terms of overall Co SA-Score.
Researcher Affiliation	Collaboration	Jingyu Zhang Ahmed Elgohary Ahmed Magooda Daniel Khashabi Benjamin Van Durme Microsoft Responsible AI Research Johns Hopkins University Work done during Jingyu Zhang s internship at Microsoft. Correspondence to Jingyu Zhang {EMAIL} and Ahmed Elgohary {EMAIL}.
Pseudocode	Yes	Algorithm 1 Co SAlign response generation, error-scoring mechanism, and response paring
Open Source Code	No	Project page: https://aka.ms/controllable-safety-alignment
Open Datasets	Yes	We use the Beaver Tails dataset sourced from https://github.com/PKU-Alignment/Beaver Tails with Apache-2.0 license, and the Wild Guard Mix dataset sourced from https://huggingface.co/datasets/allenai/ wildguardmix with ODC-By license.
Dataset Splits	Yes	A.8 COSALIGN-TEST CONSTRUCTION We provide the breakdown of test prompt categories as follows, with number of prompts specified in parathesis. Seen configs: Test config: no risk allowed Allowed prompts (100): * No risk (100 prompts) Disallowed prompts (300):
Hardware Specification	Yes	All experiments are conducted with 4 NVIDIA A100 80GB GPUs.
Software Dependencies	No	The paper mentions software components like GPT-4o model, LLAMA3.1-8B-INSTRUCT, lm-evaluation-harness codebase, and Llama-Guard-3-8B, but does not provide specific version numbers for these or for core programming languages/libraries (e.g., Python, PyTorch, CUDA) required to replicate the experiment.
Experiment Setup	Yes	We choose hyperparameters α = 0.1, β = 3, γ = 1 to ensure α < γ < β. During training, we conduct SFT and DPO with the RMSPromp optimizer and learning rate of 5e-7, and DPO β = 0.1.