Programming Refusal with Conditional Activation Steering

Authors: Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, Amit Dhurandhar

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test the conditional activation steering performance on 500 unseen Alpaca (harmless) and 450 unseen Sorry-Bench (harmful) test sets. The results are presented in Figure 1 with a subset of the data in Table 2.
Researcher Affiliation Collaboration Bruce W. Lee1, Inkit Padhi2 Karthikeyan Natesan Ramamurthy2 Erik Miehling2 Pierre Dognin2 Manish Nagireddy2 Amit Dhurandhar2 1University of Pennsylvania 2IBM Research
Pseudocode Yes Pseudocode for the harmful prompt generation: ... Pseudocode for the fine-grained category generation: ... # As implemented in the replication version of our opensource code. def find_best_condition_point(positive_strings , negative_strings , condition_vector , 3 layer_range , max_layers_to_combine , 4 threshold_range , threshold_step) :
Open Source Code Yes We release an open-source implementation of our framework at github.com/IBM/activation-steering. ... Codebase: We release a general-purpose activation steering toolkit with demo datasets for the broader activation engineering community at github.com/IBM/activation-steering.
Open Datasets Yes sorrybench: sorry-bench/sorry-bench-202406 <b34822276edde97592eda99c0b56d306f8830469> alpaca: Ed Berg/yahmaalpaca-cleaned <6b6ff0e894d31390fa3581bf56f3bafaed9d5e2d> refusal classifier: protectai/distilroberta-base-rejection-v1 <65584967c3f22ff7723e5370c65e0e76791e6055>
Dataset Splits Yes We then split this dataset into 700 prompts per category for training and 500 per category for testing.
Hardware Specification Yes CPU: 2 x AMD EPYC 7763 64-Core Processor ... GPU: NVIDIA A100-SXM4-80GB
Software Dependencies Yes Python Version: 3.10.5 Py Torch: 2.3.0 Transformers: 4.43.3
Experiment Setup Yes For the condition vector, we use a grid search (Appendix C.2) algorithm that determines the best threshold, layer, and comparison direction (> or <). ... The primary hyperparameters for conditioning can be conceptualized in a statement: Steer when the {best threshold} is {best direction} than the cosine similarity at {best layer}.