reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

An Adversarial Perspective on Machine Unlearning for AI Safety

Authors: Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, Javier Rando

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform the first comprehensive white-box evaluation of state-of-the-art unlearning methods for hazardous knowledge, comparing them to traditional safety training with DPO (Rafailov et al., 2024). Our results show that while unlearning is robust against specific attacks like probing internal model activations, it can also be easily compromised with methods similar to those used against safety training. We use the accuracy on the WMDP benchmark (Li et al., 2024) to measure the hazardous knowledge contained in LLMs.
Researcher Affiliation	Academia	Jakub Łucki1 Boyi Wei2 Yangsibo Huang2 Peter Henderson2 Florian Tramèr1 Javier Rando1 1ETH Zurich 2Princeton University
Pseudocode	Yes	Algorithm 1 Informed Perturbation Algorithm Algorithm 2 Insert Perturbation Algorithm
Open Source Code	Yes	Code is available at: https://github.com/ethz-spylab/unlearning-vs-safety
Open Datasets	Yes	We use the accuracy on the WMDP benchmark (Li et al., 2024) to measure the hazardous knowledge contained in LLMs. This model results from finetuning Zephyr-7B-β (Tunstall et al., 2023) on the WMDP and Wiki Text corpora (Merity et al., 2016). For NPO, we use the preference dataset on hazardous knowledge as negative samples and the retain preference dataset mixed with Open Assistant (50:50) dataset for the auxiliary retain loss.
Dataset Splits	Yes	We balance the training data by including samples from the forget and retain preference datasets, as well as Open Assistant (Köpf et al., 2024), in a 50:25:25 ratio. Dataset proportions 50:25:25
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models or processor types used for running its experiments.
Software Dependencies	No	The paper mentions 'Optimizer adamw_torch' but does not provide specific version numbers for software libraries or dependencies used in the implementation.
Experiment Setup	Yes	The best hyperparameters are the following: Table 2: Best found hyperparameters for DPO and NPO for each knowledge domain. Learning rate 1e-6, β 0.1, Dataset proportions 50:25:25, α 0.5, Epochs 2, Max length 1024, Gradient accumulation steps 1, Per device batch size 4, Warmup steps 150, Quantization bf16.