An Adversarial Perspective on Machine Unlearning for AI Safety
Authors: Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, Javier Rando
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform the first comprehensive white-box evaluation of state-of-the-art unlearning methods for hazardous knowledge, comparing them to traditional safety training with DPO (Rafailov et al., 2024). Our results show that while unlearning is robust against specific attacks like probing internal model activations, it can also be easily compromised with methods similar to those used against safety training. We use the accuracy on the WMDP benchmark (Li et al., 2024) to measure the hazardous knowledge contained in LLMs. |
| Researcher Affiliation | Academia | Jakub Łucki1 Boyi Wei2 Yangsibo Huang2 Peter Henderson2 Florian Tramèr1 Javier Rando1 1ETH Zurich 2Princeton University |
| Pseudocode | Yes | Algorithm 1 Informed Perturbation Algorithm Algorithm 2 Insert Perturbation Algorithm |
| Open Source Code | Yes | Code is available at: https://github.com/ethz-spylab/unlearning-vs-safety |
| Open Datasets | Yes | We use the accuracy on the WMDP benchmark (Li et al., 2024) to measure the hazardous knowledge contained in LLMs. This model results from finetuning Zephyr-7B-β (Tunstall et al., 2023) on the WMDP and Wiki Text corpora (Merity et al., 2016). For NPO, we use the preference dataset on hazardous knowledge as negative samples and the retain preference dataset mixed with Open Assistant (50:50) dataset for the auxiliary retain loss. |
| Dataset Splits | Yes | We balance the training data by including samples from the forget and retain preference datasets, as well as Open Assistant (Köpf et al., 2024), in a 50:25:25 ratio. Dataset proportions 50:25:25 |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models or processor types used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Optimizer adamw_torch' but does not provide specific version numbers for software libraries or dependencies used in the implementation. |
| Experiment Setup | Yes | The best hyperparameters are the following: Table 2: Best found hyperparameters for DPO and NPO for each knowledge domain. Learning rate 1e-6, β 0.1, Dataset proportions 50:25:25, α 0.5, Epochs 2, Max length 1024, Gradient accumulation steps 1, Per device batch size 4, Warmup steps 150, Quantization bf16. |