reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Programming Refusal with Conditional Activation Steering

Authors: Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, Amit Dhurandhar

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test the conditional activation steering performance on 500 unseen Alpaca (harmless) and 450 unseen Sorry-Bench (harmful) test sets. The results are presented in Figure 1 with a subset of the data in Table 2.
Researcher Affiliation	Collaboration	Bruce W. Lee1, Inkit Padhi2 Karthikeyan Natesan Ramamurthy2 Erik Miehling2 Pierre Dognin2 Manish Nagireddy2 Amit Dhurandhar2 1University of Pennsylvania 2IBM Research
Pseudocode	Yes	Pseudocode for the harmful prompt generation: ... Pseudocode for the fine-grained category generation: ... # As implemented in the replication version of our opensource code. def find_best_condition_point(positive_strings , negative_strings , condition_vector , 3 layer_range , max_layers_to_combine , 4 threshold_range , threshold_step) :
Open Source Code	Yes	We release an open-source implementation of our framework at github.com/IBM/activation-steering. ... Codebase: We release a general-purpose activation steering toolkit with demo datasets for the broader activation engineering community at github.com/IBM/activation-steering.
Open Datasets	Yes	sorrybench: sorry-bench/sorry-bench-202406 <b34822276edde97592eda99c0b56d306f8830469> alpaca: Ed Berg/yahmaalpaca-cleaned <6b6ff0e894d31390fa3581bf56f3bafaed9d5e2d> refusal classifier: protectai/distilroberta-base-rejection-v1 <65584967c3f22ff7723e5370c65e0e76791e6055>
Dataset Splits	Yes	We then split this dataset into 700 prompts per category for training and 500 per category for testing.
Hardware Specification	Yes	CPU: 2 x AMD EPYC 7763 64-Core Processor ... GPU: NVIDIA A100-SXM4-80GB
Software Dependencies	Yes	Python Version: 3.10.5 Py Torch: 2.3.0 Transformers: 4.43.3
Experiment Setup	Yes	For the condition vector, we use a grid search (Appendix C.2) algorithm that determines the best threshold, layer, and comparison direction (> or <). ... The primary hyperparameters for conditioning can be conceptualized in a statement: Steer when the {best threshold} is {best direction} than the cosine similarity at {best layer}.