Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Authors: Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We pit state-of-the-art techniques for removing harmful LLM capabilities against a suite of 5 input-space and 6 model tampering attacks. In addition to benchmarking these methods against each other, we show that (1) model resilience to capability elicitation attacks lies on a low-dimensional robustness subspace; (2) the success rate of model tampering attacks can empirically predict and offer conservative estimates for the success of held-out input-space attacks; and (3) state-of-the-art unlearning methods can easily be undone within 16 steps of fine-tuning. |
| Researcher Affiliation | Collaboration | Zora Che, University of Maryland, ML Alignment & Theory Scholars EMAIL Stephen Casper, MIT CSAIL, ML Alignment & Theory Scholars EMAIL Robert Kirk, UK AI Security Institute EMAIL Anirudh Satheesh, University of Maryland EMAIL Stewart Slocum, MIT EMAIL Lev Mc Kinney, University of Toronto EMAIL Rohit Gandikota, Northeastern University EMAIL Aidan Ewart, Haize Labs EMAIL Domenic Rosati, Dalhousie University EMAIL Zichu Wu, University of Waterloo EMAIL Zikui Cai, University of Maryland EMAIL Bilal Chughtai, Apollo Research EMAIL Yarin Gal, UK AI Security Institute, University of Oxford EMAIL Furong Huang, University of Maryland EMAIL Dylan Hadfield-Menell, MIT EMAIL |
| Pseudocode | No | No pseudocode or algorithm blocks are explicitly provided in the paper. Methods are described in text and tables. |
| Open Source Code | Yes | 1We release models at https://huggingface.co/LLM-GAT. |
| Open Datasets | Yes | Attacks on unlearning non-fine-tuning: we used 64 held-out examples of multiple-choice biology questions from the WMDP-Bio test set. Defenses machine unlearning methods: We unlearn dual-use bio-hazardous knowledge on Llama-3-8B-Instruct Dubey et al. (2024) with the unlearning methods listed in Table 1 and outlined in Appendix A.2.1. For all methods, we train on 1,600 examples of max length 512 from the bio-remove-split of the WMDP forget set (Li et al., 2024b), and up to 1,600 examples of max length 512 from Wikitext as the retain set . |
| Dataset Splits | Yes | For all methods, we train on 1,600 examples of max length 512 from the bio-remove-split of the WMDP forget set (Li et al., 2024b), and up to 1,600 examples of max length 512 from Wikitext as the retain set . Attacks on unlearning non-fine-tuning: we used 64 held-out examples of multiple-choice biology questions from the WMDP-Bio test set. For details on attack configurations, including the number of examples, batch size, number of steps, and other hyper-parameters, see Appendix A.4. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, or memory) are explicitly provided for running the experiments in the main text or appendices. The paper mentions LLMs generally, and one citation mentions 'one gpu minute' but this refers to a third-party tool. |
| Software Dependencies | No | The paper mentions several models and unlearning methods (e.g., 'Llama-3-8B-Instruct', 'Grad Diff', 'RMU'), but does not provide specific version numbers for software libraries or programming languages used in the implementation. For example, it cites PyTorch but does not specify a version number. |
| Experiment Setup | Yes | Appendix A.2.2, titled 'Hyperparameters', details specific settings for various unlearning methods, including 'Lo RA Rank: 256', 'Learning Rate: 10 4', 'Batch Size: 32', 'Unlearning Loss Coefficient β: 14'. Additionally, Tables 4 and 5 provide hyperparameters for fine-tuning attacks, such as '# of Examples', 'Batch Size', 'Learning Rate', 'Epochs', and 'Total Steps'. |