Programming Refusal with Conditional Activation Steering
Authors: Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, Amit Dhurandhar
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test the conditional activation steering performance on 500 unseen Alpaca (harmless) and 450 unseen Sorry-Bench (harmful) test sets. The results are presented in Figure 1 with a subset of the data in Table 2. |
| Researcher Affiliation | Collaboration | Bruce W. Lee1, Inkit Padhi2 Karthikeyan Natesan Ramamurthy2 Erik Miehling2 Pierre Dognin2 Manish Nagireddy2 Amit Dhurandhar2 1University of Pennsylvania 2IBM Research |
| Pseudocode | Yes | Pseudocode for the harmful prompt generation: ... Pseudocode for the fine-grained category generation: ... # As implemented in the replication version of our opensource code. def find_best_condition_point(positive_strings , negative_strings , condition_vector , 3 layer_range , max_layers_to_combine , 4 threshold_range , threshold_step) : |
| Open Source Code | Yes | We release an open-source implementation of our framework at github.com/IBM/activation-steering. ... Codebase: We release a general-purpose activation steering toolkit with demo datasets for the broader activation engineering community at github.com/IBM/activation-steering. |
| Open Datasets | Yes | sorrybench: sorry-bench/sorry-bench-202406 <b34822276edde97592eda99c0b56d306f8830469> alpaca: Ed Berg/yahmaalpaca-cleaned <6b6ff0e894d31390fa3581bf56f3bafaed9d5e2d> refusal classifier: protectai/distilroberta-base-rejection-v1 <65584967c3f22ff7723e5370c65e0e76791e6055> |
| Dataset Splits | Yes | We then split this dataset into 700 prompts per category for training and 500 per category for testing. |
| Hardware Specification | Yes | CPU: 2 x AMD EPYC 7763 64-Core Processor ... GPU: NVIDIA A100-SXM4-80GB |
| Software Dependencies | Yes | Python Version: 3.10.5 Py Torch: 2.3.0 Transformers: 4.43.3 |
| Experiment Setup | Yes | For the condition vector, we use a grid search (Appendix C.2) algorithm that determines the best threshold, layer, and comparison direction (> or <). ... The primary hyperparameters for conditioning can be conceptualized in a statement: Steer when the {best threshold} is {best direction} than the cosine similarity at {best layer}. |