reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety

Authors: Zihan Guan, Mengxuan Hu, Ronghang Zhu, Sheng Li, Anil Vullikanti

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across seven mainstream LLMs demonstrate that our attack exhibits high transferability across different architectures and remains effective in practical scenarios. Alarmingly, our results indicate that most existing mitigation strategies fail to defend against this attack, underscoring the urgent need for more robust alignment safeguards.
Researcher Affiliation	Academia	1University of Virginia 2University of Georgia. Correspondence to: Anil Vullikanti <EMAIL>, Sheng Li <EMAIL>.
Pseudocode	No	The paper describes methods using mathematical equations and text, but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Codes are available at https://github.com/Guan Zihan/Benign Samples-Matter/.
Open Datasets	Yes	Specifically, we prompt the model with harmful queries from HEx-PHI (Qi et al., 2023), which consists of 330 samples across 11 categories. ... We use Dolly (Conover et al., 2023) and Alpaca (Taori et al., 2023) as the benign datasets for further fine-tuning. ... (2) 2000 random samples from the Asclepius dataset (Kweon et al., 2024), which is a popular clinical QA dataset. ... Bianchi (Bianchi et al., 2023). The Bianchi dataset contains 2483 safety samples created by (Bianchi et al., 2023). ... Beaver Tails (Ji et al., 2024). The Beaver Tails dataset is a conceptual dataset designed to study and evaluate the risks of fine-tuning large language models (LLMs) with adversarial or harmful examples.
Dataset Splits	Yes	Our proposed method is to filter top-k (k=100) samples Ds with the highest Self-Inf scores as follows, ... To simulate the attack, we evaluate two types of continuous fine-tuning datasets: (1) in-distribution task-specific samples drawn from the same distribution as the attacker’s selected samples, and (2) out-of-distribution task-specific samples whose distributions are different from those used by the attacker. ... we continue to fine-tune the model with (1) 2000 random samples from the Dolly dataset and (2) 2000 random samples from the Asclepius dataset (Kweon et al., 2024) ... we define a poisoning ratio α as α = # mixed benign samples / Total # samples . ... the final fine-tuning dataset is composed of: (1) N α samples selected by the Self-Inf N method and (2) N (1 α) benign samples randomly selected from the original dataset.
Hardware Specification	Yes	All experiments are conducted on a server equipped with 2 A100 (80GB) GPUs.
Software Dependencies	No	The paper mentions using models like Llama2-7B-Chat, Qwen-2-7B-Instruct, Gemma-2-9B-IT, Mistral-8B-Instruct, Llama-3-8B-Chat, GPT-4, and tools like Perspective API and Open AI Moderation API, and references LLAMA-Factory for recommendations. However, it does not specify version numbers for any key software components or libraries.
Experiment Setup	Yes	For the preliminary experiment, we fine-tune the LLMs over the 100 samples filtered from the benign dataset, with a learning rate of 5e-5, a batch size of 20, and a fine-tuning epoch of 5. ... we use a learning rate of 2e-5 for Ministral8B-Instruct-2410, Qwen2-7B-Instruct, and Llama-3-8B-Chat following recommendations in LLaMA-Factory. Unless otherwise specified, the number of fine-tuning samples is set to 100, and the number of epochs is set to 5. For models smaller than 7B, we use a batch size of 20 per device, while for models larger than 7B, the batch size is reduced to 10 per device due to GPU memory limitations. ... We implement LoRA with a learning rate of 2e-3, batch size of 10, and a fine-tuning epoch 10 following (Qi et al., 2023).