Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety

Authors: Zihan Guan, Mengxuan Hu, Ronghang Zhu, Sheng Li, Anil Vullikanti

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across seven mainstream LLMs demonstrate that our attack exhibits high transferability across different architectures and remains effective in practical scenarios. Alarmingly, our results indicate that most existing mitigation strategies fail to defend against this attack, underscoring the urgent need for more robust alignment safeguards.
Researcher Affiliation Academia 1University of Virginia 2University of Georgia. Correspondence to: Anil Vullikanti <EMAIL>, Sheng Li <EMAIL>.
Pseudocode No The paper describes methods using mathematical equations and text, but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Codes are available at https://github.com/Guan Zihan/Benign Samples-Matter/.
Open Datasets Yes Specifically, we prompt the model with harmful queries from HEx-PHI (Qi et al., 2023), which consists of 330 samples across 11 categories. ... We use Dolly (Conover et al., 2023) and Alpaca (Taori et al., 2023) as the benign datasets for further fine-tuning. ... (2) 2000 random samples from the Asclepius dataset (Kweon et al., 2024), which is a popular clinical QA dataset. ... Bianchi (Bianchi et al., 2023). The Bianchi dataset contains 2483 safety samples created by (Bianchi et al., 2023). ... Beaver Tails (Ji et al., 2024). The Beaver Tails dataset is a conceptual dataset designed to study and evaluate the risks of fine-tuning large language models (LLMs) with adversarial or harmful examples.
Dataset Splits Yes Our proposed method is to filter top-k (k=100) samples Ds with the highest Self-Inf scores as follows, ... To simulate the attack, we evaluate two types of continuous fine-tuning datasets: (1) in-distribution task-specific samples drawn from the same distribution as the attacker’s selected samples, and (2) out-of-distribution task-specific samples whose distributions are different from those used by the attacker. ... we continue to fine-tune the model with (1) 2000 random samples from the Dolly dataset and (2) 2000 random samples from the Asclepius dataset (Kweon et al., 2024) ... we define a poisoning ratio α as α = # mixed benign samples / Total # samples . ... the final fine-tuning dataset is composed of: (1) N α samples selected by the Self-Inf N method and (2) N (1 α) benign samples randomly selected from the original dataset.
Hardware Specification Yes All experiments are conducted on a server equipped with 2 A100 (80GB) GPUs.
Software Dependencies No The paper mentions using models like Llama2-7B-Chat, Qwen-2-7B-Instruct, Gemma-2-9B-IT, Mistral-8B-Instruct, Llama-3-8B-Chat, GPT-4, and tools like Perspective API and Open AI Moderation API, and references LLAMA-Factory for recommendations. However, it does not specify version numbers for any key software components or libraries.
Experiment Setup Yes For the preliminary experiment, we fine-tune the LLMs over the 100 samples filtered from the benign dataset, with a learning rate of 5e-5, a batch size of 20, and a fine-tuning epoch of 5. ... we use a learning rate of 2e-5 for Ministral8B-Instruct-2410, Qwen2-7B-Instruct, and Llama-3-8B-Chat following recommendations in LLaMA-Factory. Unless otherwise specified, the number of fine-tuning samples is set to 100, and the number of epochs is set to 5. For models smaller than 7B, we use a batch size of 20 per device, while for models larger than 7B, the batch size is reduced to 10 per device due to GPU memory limitations. ... We implement LoRA with a learning rate of 2e-3, batch size of 10, and a fine-tuning epoch 10 following (Qi et al., 2023).