reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Activation Space Interventions Can Be Transferred Between Large Language Models

Authors: Narmeen Fatimah Oozeer, Dhruv Nathawani, Nirmalendu Prakash, Michael Lan, Abir Harrasse, Amir Abdullah

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across Llama, Qwen and Gemma model families show that our method enables using smaller models to efficiently align larger ones. We demonstrate this approach on two wellestablished AI safety tasks: backdoor removal and refusal of harmful prompts, showing successful transfer of steering vectors that alter the models outputs in a predictable way.
Researcher Affiliation	Collaboration	1Martian Learning; 2Nvidia; 3Singapore University of Technology and Design; 4Independent; 5Senior author; Martian Learning; 6Thoughtworks. Correspondence to: Narmeen Fatimah Oozeer <EMAIL>, Amir Abdullah <amir. عبدالله@thoughtworks.com>.
Pseudocode	No	The paper describes methods like Prompt Steering and Difference in Means (Appendix A) using mathematical formulas, but does not present them in structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at Git Hub repository.
Open Datasets	Yes	We include samples from hh-rlhf dataset (Bai et al., 2022) to retain general capabilities of the model. We use the harmful and harmless dataset from Arditi et al. (2024) to find steerable layers in individual models. For autoencoder training, we augment the data with Wild Guard Mix (Han et al., 2024), as we find their dataset very small for training purposes. Using the settings from Arditi et al. (2024), we use randomly sampled 100 instructions form Jailbreak Bench (Chao et al., 2024) to evaluate jailbreak success and 4096 random samples from The Pile (Gao et al., 2020) and Alpaca (Taori et al., 2023) datasets each to evaluate completion coherence. Our datasets and models are available at Hugging Face collection. We take a sample of the Medi QA dataset (Abacha et al., 2019), and create 178 harder question answer snippets using the prompt in Box J to curate via GPT-4.
Dataset Splits	Yes	The autoencoder is trained on a training split of the task dataset and evaluated using steering vector transfer rates and autoencoder validation scores on a test split. Validation Split 0.1 (10% of data). Using the settings from Arditi et al. (2024), we use randomly sampled 100 instructions form Jailbreak Bench (Chao et al., 2024) to evaluate jailbreak success and 4096 random samples from The Pile (Gao et al., 2020) and Alpaca (Taori et al., 2023) datasets each to evaluate completion coherence. We take 2000 samples from each dataset, and train and evaluate a logistic regression probe on a 80-20 data split over 1000 iterations.
Hardware Specification	No	This choice helps keep the computation requirements low and at the same time helps establish the validity of our findings.
Software Dependencies	No	We utilized the transformers library alongside the trl extension for instruction fine-tuning, adhering to the standard Alpaca template. The fine-tuning process employed an Adam W optimizer with a weight decay of 0.05. For models leveraging flash-attention v2, such as Llama and Qwen variants, we enabled this feature to enhance attention computation efficiency and reduce memory overhead. We use the standard Scikit-learn implementation and default parameters for a logistic regression classifier (Pedregosa et al., 2011).
Experiment Setup	Yes	For intervention, we apply methods such as Activation Addition (Act Add) and Activation Ablation (Arditi et al., 2024). A detailed explanation of these techniques is provided in Appendix A. We set steering magnitude, α = 5 and average steering success over 50 randomly sampled prompts. The fine-tuning process employed an Adam W optimizer with a weight decay of 0.05. A learning rate of 2e-5 was selected based on preliminary experiments that demonstrated more stable convergence compared to higher rates, which resulted in larger validation oscillations. We set the maximum sequence length to 1024 tokens to accommodate lengthy conversations and code snippets. The effective batch size was maintained at 32 per update step. (From Appendix D.3, Table 22) Max Seq. Length 720 tokens, Batch Size 16 samples, Learning Rate 1e-4, Optimizer Adam W, Number of Epochs 3, Validation Split 0.1 (10% of data), Gradient Accumulation Steps 4. We train the SAE on layer 13 of Qwen-0.5 using the I HATE YOU dataset. We use a batch size of 16, a max sequence length of 512, and run training for 8 epochs. We choose top-K as our activation function, using k=128, and use an expansion factor of 16 on the original model dimension of 896.