Learning Safety Constraints for Large Language Models

Authors: Xin Chen, Yarden As, Andreas Krause

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across multiple LLMs demonstrate that our method can effectively detect unethical inputs, reduce adversarial attack success rates while maintaining performance on standard tasks, thus highlighting the importance of having an explicit geometric model for safety. Analysis of the learned polytope facets reveals emergence of specialization in detecting different semantic notions of safety, providing interpretable insights into how safety is captured in LLMs representation space.
Researcher Affiliation Academia 1Department of Computer Science, ETH Z urich, Z urich, Switzerland. Correspondence to: Xin Chen <EMAIL>.
Pseudocode Yes Algorithm 1 Safe Flow: Representation Steering for Safe Response Generation
Open Source Code Yes Our code is publicly available at https://github. com/lasgroup/Safety Polytope.
Open Datasets Yes We evaluate the steering capability of Algorithm 1 on Harm Bench (Mazeika et al., 2024), a standardized benchmark for assessing LLM safety against adversarial attacks. [...] To analyze how well the facets capture different safety concepts, we use the Beaver Tails dataset (Ji et al., 2023), which contains 330k annotated sentences in 14 safety categories.
Dataset Splits Yes (ii) for these methods, we collect model features from 80% of their attack strings for training, reserving 20% for testing;
Hardware Specification No No specific hardware details such as GPU or CPU models were provided in the paper. The text mentions 'existing tools for vectorized computation (Bradbury et al., 2018; Agrawal et al., 2019; Blondel et al., 2022; Lu et al., 2024)' and refers to model quantization (e.g., '16-bit precision (float16 for Llama-2 and bfloat16 for Ministral and Qwen)'), but lacks concrete hardware specifications used for running the experiments.
Software Dependencies No The paper mentions software components and libraries like 'Adam optimizer' and 'scikit-learn', but does not provide specific version numbers for these or any other key software dependencies.
Experiment Setup Yes For the polytope training phase, we use the Adam optimizer with a learning rate of 10 2 and batch size of 128. The feature extractor projects hidden states to a 16,384-dimensional space followed by Re LU activation. The loss function uses an entropy weight 1.0 and feature L1 regularization weight λϕ = 1.0. The margin parameter κ varies across model architectures: 60.0 for Llama-2 7B, 5.0 for Ministral 8B, and 30.0 for Qwen2 1.5B, reflecting different geometric requirements in their respective representation spaces. During the steering phase, we apply hidden state optimization at layer 20 with model-specific configurations. For Llama-2 7B, we set λunsafe = 4.0 and λsafe = 10 4 for the optimization objective. Ministral 8B uses λunsafe = 0.25 without safety violation penalty (λsafe = 0). For Qwen2 1.5B, we set λunsafe = 10.0 and λsafe = 5000.