On Evaluating the Durability of Safeguards for Open-Weight LLMs
Authors: Xiangyu Qi, Boyi Wei, Nicholas Carlini, Yangsibo Huang, Tinghao Xie, Luxi He, Matthew Jagielski, Milad Nasr, Prateek Mittal, Peter Henderson
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through several case studies, we demonstrate that even evaluating these defenses is exceedingly difficult and can easily mislead audiences into thinking that safeguards are more durable than they really are. We draw lessons from the evaluation pitfalls that we identify and suggest future research carefully cabin claims to more constrained, well-defined, and rigorously examined threat models, which can provide more useful and candid assessments to stakeholders. In this paper, we examine several common pitfalls to demonstrate how the durability evaluation of LLM safeguards can easily go wrong. Specifically, we focus on empirical case studies of two recently proposed safeguards for open-weight LLMs (Tamirisa et al., 2024; Rosati et al., 2024). |
| Researcher Affiliation | Collaboration | Xiangyu Qi1, Boyi Wei1, Nicholas Carlini2 Yangsibo Huang1,2 Tinghao Xie1 Luxi He1 Matthew Jagielski2 Milad Nasr2 Prateek Mittal1 Peter Henderson1 1Princeton University 2Google Equal Contribution Correspondence to: Xiangyu Qi (EMAIL), Boyi Wei (EMAIL), Prateek Mittal (EMAIL), Peter Henderson (EMAIL). |
| Pseudocode | No | The paper describes the methods using mathematical formulations and textual descriptions in Appendix D (e.g., LRep Noise = Lx Dretain(x, θ) αLx Dforget(x, θ) + βLnoise), but does not present them in a structured pseudocode or algorithm block format. |
| Open Source Code | Yes | Code available at: https://github.com/AI-Law-Society-Lab/Evaluating-Durable-Safeguards. |
| Open Datasets | Yes | Beaver Tails (Ji et al., 2024) is the benchmark used by Rosati et al. (2024) to evaluate Rep Noise; we adopt the same evaluation setup... We also consider two additional harmfulness evaluation datasets: HEx-PHI (Qi et al., 2023) and SORRY-bench (Xie et al., 2024). HEx-PHI and SORRY-bench are two dedicated benchmarks for evaluating harmfulness in the Harmful QA context... All of our evaluations of TAR are at its public checkpoints, with bio-weaponization knowledge restrictions. Specifically, by default, we report results on the latest v2 version of the checkpoint Llama-3-8B-Instruct-TAR-Bio-v24(we will term it as TAR-Bio-v2) per the request of the original authors. For the threat model of weaponization knowledge restriction, LLMs are evaluated on the Weapons of Mass Destruction Proxy (WMDP) benchmark (Li et al., 2024)... Two datasets are involved, corresponding to the forget set (the model is to unlearn) and the retain set (the model should not unlearn) used by the original TAR training in the biosecurity domain. Pile-Bio Forget is the forget set; Pile-Bio Retain is the retain set. See Appendix E.2 for details. Lapis Labs. pile-bio dataset, 2024. URL https://huggingface.co/lapisrocks. |
| Dataset Splits | Yes | The only difference is that we enable random shuffling of the fine-tuning dataset to introduce randomness into the fine-tuning attack... We use the same checkpoint released by the authors, the same hyperparameters, the same codebase, and datasets (for both fine-tuning attacks and harmfulness evaluations)... We use the same subset of Beaver Tails-30k-train from the official codebase... Dataset Size = 1000 Dataset Size = 10000... Dataset Partition. Rosati et al. (2024) use a filtered subset of Beaver Tails-30k-train as the dataset for training and attack Rep Noise, and use a filtered subset of Beaver Tails-30k-test as the test set for harmfulness evaluation. The train set/attack set is highly overlapped with the test set. There are 75.3% of elements in the test set that also appear in the training set and attack set. |
| Hardware Specification | Yes | For all experiments conducted in the official codebase on Rep Noise, we use 1 NVIDIA-H100-80G-GPU with gradient_accumulation_steps=1. We use 4 NVIDIA-H100-80G-GPUs to run the experiments with gradient_accumulation_steps=1. For all experiments using TAR s official codebase, we fine-tune the model for 1000 steps on 4 NVIDIA-H100-80G GPUs with gradient_accumulation_steps=2. For all experiments in our codebase, we use 4 NVIDIA-H100-80G GPUs with gradient_accumulation_steps=2. |
| Software Dependencies | No | The paper mentions several software components like “Huggingface SFT Trainer”, “Huggingface TRL library”, “PyTorch profiling tool”, “transformers.TrainingArguments.lr_scheduler_type”, but does not specify their version numbers. |
| Experiment Setup | Yes | Table 1: Fine-tuning configurations that we evaluate for TAR. This includes 4 configurations (Orig-Config) from the original study of Tamirisa et al. (2024) and 2 new configurations (New-Config) we design. Two datasets are involved, corresponding to the forget set (the model is to unlearn) and the retain set (the model should not unlearn) used by the original TAR training in the biosecurity domain. Pile-Bio Forget is the forget set; Pile-Bio Retain is the retain set. See Appendix E.2 for details. Table 2: Hyperparameter configurations used in our exact implementation of Rep Noise. For fine-tuning dataset, we use the same subset of Beaver Tails-30k-train from the official codebase. FT Dataset LR # Examples Optimizer LR scheduler Warmup Ratio. |