Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Authors: Zihan Guan, Mengxuan Hu, Ronghang Zhu, Sheng Li, Anil Vullikanti
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across seven mainstream LLMs demonstrate that our attack exhibits high transferability across different architectures and remains effective in practical scenarios. Alarmingly, our results indicate that most existing mitigation strategies fail to defend against this attack, underscoring the urgent need for more robust alignment safeguards. |
| Researcher Affiliation | Academia | 1University of Virginia 2University of Georgia. Correspondence to: Anil Vullikanti <EMAIL>, Sheng Li <EMAIL>. |
| Pseudocode | No | The paper describes methods using mathematical equations and text, but does not contain a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Codes are available at https://github.com/Guan Zihan/Benign Samples-Matter/. |
| Open Datasets | Yes | Specifically, we prompt the model with harmful queries from HEx-PHI (Qi et al., 2023), which consists of 330 samples across 11 categories. ... We use Dolly (Conover et al., 2023) and Alpaca (Taori et al., 2023) as the benign datasets for further fine-tuning. ... (2) 2000 random samples from the Asclepius dataset (Kweon et al., 2024), which is a popular clinical QA dataset. ... Bianchi (Bianchi et al., 2023). The Bianchi dataset contains 2483 safety samples created by (Bianchi et al., 2023). ... Beaver Tails (Ji et al., 2024). The Beaver Tails dataset is a conceptual dataset designed to study and evaluate the risks of fine-tuning large language models (LLMs) with adversarial or harmful examples. |
| Dataset Splits | Yes | Our proposed method is to filter top-k (k=100) samples Ds with the highest Self-Inf scores as follows, ... To simulate the attack, we evaluate two types of continuous fine-tuning datasets: (1) in-distribution task-specific samples drawn from the same distribution as the attacker’s selected samples, and (2) out-of-distribution task-specific samples whose distributions are different from those used by the attacker. ... we continue to fine-tune the model with (1) 2000 random samples from the Dolly dataset and (2) 2000 random samples from the Asclepius dataset (Kweon et al., 2024) ... we define a poisoning ratio α as α = # mixed benign samples / Total # samples . ... the final fine-tuning dataset is composed of: (1) N α samples selected by the Self-Inf N method and (2) N (1 α) benign samples randomly selected from the original dataset. |
| Hardware Specification | Yes | All experiments are conducted on a server equipped with 2 A100 (80GB) GPUs. |
| Software Dependencies | No | The paper mentions using models like Llama2-7B-Chat, Qwen-2-7B-Instruct, Gemma-2-9B-IT, Mistral-8B-Instruct, Llama-3-8B-Chat, GPT-4, and tools like Perspective API and Open AI Moderation API, and references LLAMA-Factory for recommendations. However, it does not specify version numbers for any key software components or libraries. |
| Experiment Setup | Yes | For the preliminary experiment, we fine-tune the LLMs over the 100 samples filtered from the benign dataset, with a learning rate of 5e-5, a batch size of 20, and a fine-tuning epoch of 5. ... we use a learning rate of 2e-5 for Ministral8B-Instruct-2410, Qwen2-7B-Instruct, and Llama-3-8B-Chat following recommendations in LLaMA-Factory. Unless otherwise specified, the number of fine-tuning samples is set to 100, and the number of epochs is set to 5. For models smaller than 7B, we use a batch size of 20 per device, while for models larger than 7B, the batch size is reduced to 10 per device due to GPU memory limitations. ... We implement LoRA with a learning rate of 2e-3, batch size of 10, and a fine-tuning epoch 10 following (Qi et al., 2023). |