reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

Authors: Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We pit state-of-the-art techniques for removing harmful LLM capabilities against a suite of 5 input-space and 6 model tampering attacks. In addition to benchmarking these methods against each other, we show that (1) model resilience to capability elicitation attacks lies on a low-dimensional robustness subspace; (2) the success rate of model tampering attacks can empirically predict and offer conservative estimates for the success of held-out input-space attacks; and (3) state-of-the-art unlearning methods can easily be undone within 16 steps of fine-tuning.
Researcher Affiliation	Collaboration	Zora Che, University of Maryland, ML Alignment & Theory Scholars EMAIL Stephen Casper, MIT CSAIL, ML Alignment & Theory Scholars EMAIL Robert Kirk, UK AI Security Institute EMAIL Anirudh Satheesh, University of Maryland EMAIL Stewart Slocum, MIT EMAIL Lev Mc Kinney, University of Toronto EMAIL Rohit Gandikota, Northeastern University EMAIL Aidan Ewart, Haize Labs EMAIL Domenic Rosati, Dalhousie University EMAIL Zichu Wu, University of Waterloo EMAIL Zikui Cai, University of Maryland EMAIL Bilal Chughtai, Apollo Research EMAIL Yarin Gal, UK AI Security Institute, University of Oxford EMAIL Furong Huang, University of Maryland EMAIL Dylan Hadfield-Menell, MIT EMAIL
Pseudocode	No	No pseudocode or algorithm blocks are explicitly provided in the paper. Methods are described in text and tables.
Open Source Code	Yes	1We release models at https://huggingface.co/LLM-GAT.
Open Datasets	Yes	Attacks on unlearning non-fine-tuning: we used 64 held-out examples of multiple-choice biology questions from the WMDP-Bio test set. Defenses machine unlearning methods: We unlearn dual-use bio-hazardous knowledge on Llama-3-8B-Instruct Dubey et al. (2024) with the unlearning methods listed in Table 1 and outlined in Appendix A.2.1. For all methods, we train on 1,600 examples of max length 512 from the bio-remove-split of the WMDP forget set (Li et al., 2024b), and up to 1,600 examples of max length 512 from Wikitext as the retain set .
Dataset Splits	Yes	For all methods, we train on 1,600 examples of max length 512 from the bio-remove-split of the WMDP forget set (Li et al., 2024b), and up to 1,600 examples of max length 512 from Wikitext as the retain set . Attacks on unlearning non-fine-tuning: we used 64 held-out examples of multiple-choice biology questions from the WMDP-Bio test set. For details on attack configurations, including the number of examples, batch size, number of steps, and other hyper-parameters, see Appendix A.4.
Hardware Specification	No	No specific hardware details (like GPU models, CPU types, or memory) are explicitly provided for running the experiments in the main text or appendices. The paper mentions LLMs generally, and one citation mentions 'one gpu minute' but this refers to a third-party tool.
Software Dependencies	No	The paper mentions several models and unlearning methods (e.g., 'Llama-3-8B-Instruct', 'Grad Diff', 'RMU'), but does not provide specific version numbers for software libraries or programming languages used in the implementation. For example, it cites PyTorch but does not specify a version number.
Experiment Setup	Yes	Appendix A.2.2, titled 'Hyperparameters', details specific settings for various unlearning methods, including 'Lo RA Rank: 256', 'Learning Rate: 10 4', 'Batch Size: 32', 'Unlearning Loss Coefficient β: 14'. Additionally, Tables 4 and 5 provide hyperparameters for fine-tuning attacks, such as '# of Examples', 'Batch Size', 'Learning Rate', 'Epochs', and 'Total Steps'.