Learning Safety Constraints for Large Language Models
Authors: Xin Chen, Yarden As, Andreas Krause
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across multiple LLMs demonstrate that our method can effectively detect unethical inputs, reduce adversarial attack success rates while maintaining performance on standard tasks, thus highlighting the importance of having an explicit geometric model for safety. Analysis of the learned polytope facets reveals emergence of specialization in detecting different semantic notions of safety, providing interpretable insights into how safety is captured in LLMs representation space. |
| Researcher Affiliation | Academia | 1Department of Computer Science, ETH Z urich, Z urich, Switzerland. Correspondence to: Xin Chen <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Safe Flow: Representation Steering for Safe Response Generation |
| Open Source Code | Yes | Our code is publicly available at https://github. com/lasgroup/Safety Polytope. |
| Open Datasets | Yes | We evaluate the steering capability of Algorithm 1 on Harm Bench (Mazeika et al., 2024), a standardized benchmark for assessing LLM safety against adversarial attacks. [...] To analyze how well the facets capture different safety concepts, we use the Beaver Tails dataset (Ji et al., 2023), which contains 330k annotated sentences in 14 safety categories. |
| Dataset Splits | Yes | (ii) for these methods, we collect model features from 80% of their attack strings for training, reserving 20% for testing; |
| Hardware Specification | No | No specific hardware details such as GPU or CPU models were provided in the paper. The text mentions 'existing tools for vectorized computation (Bradbury et al., 2018; Agrawal et al., 2019; Blondel et al., 2022; Lu et al., 2024)' and refers to model quantization (e.g., '16-bit precision (float16 for Llama-2 and bfloat16 for Ministral and Qwen)'), but lacks concrete hardware specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions software components and libraries like 'Adam optimizer' and 'scikit-learn', but does not provide specific version numbers for these or any other key software dependencies. |
| Experiment Setup | Yes | For the polytope training phase, we use the Adam optimizer with a learning rate of 10 2 and batch size of 128. The feature extractor projects hidden states to a 16,384-dimensional space followed by Re LU activation. The loss function uses an entropy weight 1.0 and feature L1 regularization weight λϕ = 1.0. The margin parameter κ varies across model architectures: 60.0 for Llama-2 7B, 5.0 for Ministral 8B, and 30.0 for Qwen2 1.5B, reflecting different geometric requirements in their respective representation spaces. During the steering phase, we apply hidden state optimization at layer 20 with model-specific configurations. For Llama-2 7B, we set λunsafe = 4.0 and λsafe = 10 4 for the optimization objective. Ministral 8B uses λunsafe = 0.25 without safety violation penalty (λsafe = 0). For Qwen2 1.5B, we set λunsafe = 10.0 and λsafe = 5000. |