reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Safety Constraints for Large Language Models

Authors: Xin Chen, Yarden As, Andreas Krause

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across multiple LLMs demonstrate that our method can effectively detect unethical inputs, reduce adversarial attack success rates while maintaining performance on standard tasks, thus highlighting the importance of having an explicit geometric model for safety. Analysis of the learned polytope facets reveals emergence of specialization in detecting different semantic notions of safety, providing interpretable insights into how safety is captured in LLMs representation space.
Researcher Affiliation	Academia	1Department of Computer Science, ETH Z urich, Z urich, Switzerland. Correspondence to: Xin Chen <EMAIL>.
Pseudocode	Yes	Algorithm 1 Safe Flow: Representation Steering for Safe Response Generation
Open Source Code	Yes	Our code is publicly available at https://github. com/lasgroup/Safety Polytope.
Open Datasets	Yes	We evaluate the steering capability of Algorithm 1 on Harm Bench (Mazeika et al., 2024), a standardized benchmark for assessing LLM safety against adversarial attacks. [...] To analyze how well the facets capture different safety concepts, we use the Beaver Tails dataset (Ji et al., 2023), which contains 330k annotated sentences in 14 safety categories.
Dataset Splits	Yes	(ii) for these methods, we collect model features from 80% of their attack strings for training, reserving 20% for testing;
Hardware Specification	No	No specific hardware details such as GPU or CPU models were provided in the paper. The text mentions 'existing tools for vectorized computation (Bradbury et al., 2018; Agrawal et al., 2019; Blondel et al., 2022; Lu et al., 2024)' and refers to model quantization (e.g., '16-bit precision (float16 for Llama-2 and bfloat16 for Ministral and Qwen)'), but lacks concrete hardware specifications used for running the experiments.
Software Dependencies	No	The paper mentions software components and libraries like 'Adam optimizer' and 'scikit-learn', but does not provide specific version numbers for these or any other key software dependencies.
Experiment Setup	Yes	For the polytope training phase, we use the Adam optimizer with a learning rate of 10 2 and batch size of 128. The feature extractor projects hidden states to a 16,384-dimensional space followed by Re LU activation. The loss function uses an entropy weight 1.0 and feature L1 regularization weight λϕ = 1.0. The margin parameter κ varies across model architectures: 60.0 for Llama-2 7B, 5.0 for Ministral 8B, and 30.0 for Qwen2 1.5B, reflecting different geometric requirements in their respective representation spaces. During the steering phase, we apply hidden state optimization at layer 20 with model-specific configurations. For Llama-2 7B, we set λunsafe = 4.0 and λsafe = 10 4 for the optimization objective. Ministral 8B uses λunsafe = 0.25 without safety violation penalty (λsafe = 0). For Qwen2 1.5B, we set λunsafe = 10.0 and λsafe = 5000.