Activation Space Interventions Can Be Transferred Between Large Language Models
Authors: Narmeen Fatimah Oozeer, Dhruv Nathawani, Nirmalendu Prakash, Michael Lan, Abir Harrasse, Amir Abdullah
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across Llama, Qwen and Gemma model families show that our method enables using smaller models to efficiently align larger ones. We demonstrate this approach on two wellestablished AI safety tasks: backdoor removal and refusal of harmful prompts, showing successful transfer of steering vectors that alter the models outputs in a predictable way. |
| Researcher Affiliation | Collaboration | 1Martian Learning; 2Nvidia; 3Singapore University of Technology and Design; 4Independent; 5Senior author; Martian Learning; 6Thoughtworks. Correspondence to: Narmeen Fatimah Oozeer <EMAIL>, Amir Abdullah <amir. عبدالله@thoughtworks.com>. |
| Pseudocode | No | The paper describes methods like Prompt Steering and Difference in Means (Appendix A) using mathematical formulas, but does not present them in structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at Git Hub repository. |
| Open Datasets | Yes | We include samples from hh-rlhf dataset (Bai et al., 2022) to retain general capabilities of the model. We use the harmful and harmless dataset from Arditi et al. (2024) to find steerable layers in individual models. For autoencoder training, we augment the data with Wild Guard Mix (Han et al., 2024), as we find their dataset very small for training purposes. Using the settings from Arditi et al. (2024), we use randomly sampled 100 instructions form Jailbreak Bench (Chao et al., 2024) to evaluate jailbreak success and 4096 random samples from The Pile (Gao et al., 2020) and Alpaca (Taori et al., 2023) datasets each to evaluate completion coherence. Our datasets and models are available at Hugging Face collection. We take a sample of the Medi QA dataset (Abacha et al., 2019), and create 178 harder question answer snippets using the prompt in Box J to curate via GPT-4. |
| Dataset Splits | Yes | The autoencoder is trained on a training split of the task dataset and evaluated using steering vector transfer rates and autoencoder validation scores on a test split. Validation Split 0.1 (10% of data). Using the settings from Arditi et al. (2024), we use randomly sampled 100 instructions form Jailbreak Bench (Chao et al., 2024) to evaluate jailbreak success and 4096 random samples from The Pile (Gao et al., 2020) and Alpaca (Taori et al., 2023) datasets each to evaluate completion coherence. We take 2000 samples from each dataset, and train and evaluate a logistic regression probe on a 80-20 data split over 1000 iterations. |
| Hardware Specification | No | This choice helps keep the computation requirements low and at the same time helps establish the validity of our findings. |
| Software Dependencies | No | We utilized the transformers library alongside the trl extension for instruction fine-tuning, adhering to the standard Alpaca template. The fine-tuning process employed an Adam W optimizer with a weight decay of 0.05. For models leveraging flash-attention v2, such as Llama and Qwen variants, we enabled this feature to enhance attention computation efficiency and reduce memory overhead. We use the standard Scikit-learn implementation and default parameters for a logistic regression classifier (Pedregosa et al., 2011). |
| Experiment Setup | Yes | For intervention, we apply methods such as Activation Addition (Act Add) and Activation Ablation (Arditi et al., 2024). A detailed explanation of these techniques is provided in Appendix A. We set steering magnitude, α = 5 and average steering success over 50 randomly sampled prompts. The fine-tuning process employed an Adam W optimizer with a weight decay of 0.05. A learning rate of 2e-5 was selected based on preliminary experiments that demonstrated more stable convergence compared to higher rates, which resulted in larger validation oscillations. We set the maximum sequence length to 1024 tokens to accommodate lengthy conversations and code snippets. The effective batch size was maintained at 32 per update step. (From Appendix D.3, Table 22) Max Seq. Length 720 tokens, Batch Size 16 samples, Learning Rate 1e-4, Optimizer Adam W, Number of Epochs 3, Validation Split 0.1 (10% of data), Gradient Accumulation Steps 4. We train the SAE on layer 13 of Qwen-0.5 using the I HATE YOU dataset. We use a batch size of 16, a max sequence length of 512, and run training for 8 epochs. We choose top-K as our activation function, using k=128, and use an expansion factor of 16 on the original model dimension of 896. |