An Adversarial Perspective on Machine Unlearning for AI Safety

Authors: Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, Javier Rando

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform the first comprehensive white-box evaluation of state-of-the-art unlearning methods for hazardous knowledge, comparing them to traditional safety training with DPO (Rafailov et al., 2024). Our results show that while unlearning is robust against specific attacks like probing internal model activations, it can also be easily compromised with methods similar to those used against safety training. We use the accuracy on the WMDP benchmark (Li et al., 2024) to measure the hazardous knowledge contained in LLMs.
Researcher Affiliation Academia Jakub Łucki1 Boyi Wei2 Yangsibo Huang2 Peter Henderson2 Florian Tramèr1 Javier Rando1 1ETH Zurich 2Princeton University
Pseudocode Yes Algorithm 1 Informed Perturbation Algorithm Algorithm 2 Insert Perturbation Algorithm
Open Source Code Yes Code is available at: https://github.com/ethz-spylab/unlearning-vs-safety
Open Datasets Yes We use the accuracy on the WMDP benchmark (Li et al., 2024) to measure the hazardous knowledge contained in LLMs. This model results from finetuning Zephyr-7B-β (Tunstall et al., 2023) on the WMDP and Wiki Text corpora (Merity et al., 2016). For NPO, we use the preference dataset on hazardous knowledge as negative samples and the retain preference dataset mixed with Open Assistant (50:50) dataset for the auxiliary retain loss.
Dataset Splits Yes We balance the training data by including samples from the forget and retain preference datasets, as well as Open Assistant (Köpf et al., 2024), in a 50:25:25 ratio. Dataset proportions 50:25:25
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models or processor types used for running its experiments.
Software Dependencies No The paper mentions 'Optimizer adamw_torch' but does not provide specific version numbers for software libraries or dependencies used in the implementation.
Experiment Setup Yes The best hyperparameters are the following: Table 2: Best found hyperparameters for DPO and NPO for each knowledge domain. Learning rate 1e-6, β 0.1, Dataset proportions 50:25:25, α 0.5, Epochs 2, Max length 1024, Gradient accumulation steps 1, Per device batch size 4, Warmup steps 150, Quantization bf16.