Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets

Authors: Ning Lu, Shengcai Liu, Jiahao Wu, Weiyu Chen, Zhirui Zhang, Yew-Soon Ong, Qi Wang, Ke Tang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on four diverse datasets with varying settings, our approach consistently preserves safety while ensuring that the utility gain from benign datasets remains unaffected. 5. Experiments
Researcher Affiliation Collaboration 1Guangdong Provincial Key Laboratory of Brain-Inspired Intelligent Computation, Department of CSE, SUSTech. 2Department of CSE, HKUST. 3Department of CSE, Poly U. 4Huawei Technologies CO. 5CFAR, IHPC, A*STAR. 6College of Computing and Data Science, NTU. 7Department of CSE, SUSTech.
Pseudocode No The paper describes the method in Section 4 'The Safe Delta Method' and provides an overview in Figure 3, but it does not include a formally structured pseudocode or algorithm block.
Open Source Code Yes 1We provide the open-source code at https://github. com/Colin Lu50/Safe Delta
Open Datasets Yes To simulate harmful fine-tuning aimed at jailbreaking LLMs, we use the Pure Bad and Identity Shift datasets introduced by Qi et al. (2024). ...sampled from the Alpaca dataset (Taori et al., 2023). ...1,000 samples from the Sam Sum dataset (Gliwa et al., 2019)...training set of the GSM8k dataset (Cobbe et al., 2021). ...Beaver Tails (Ji et al., 2023)...Adv Bench (Zou et al., 2023).
Dataset Splits Yes Each dataset includes 100 examples... The Dirty Summary dataset is created by sampling 1,000 samples from the Sam Sum dataset... For the clean dataset, we use the training set of the GSM8k dataset... For Summary utility evaluation, we randomly sample 200 test examples from the Sam Sum dataset... For the evaluation of math reasoning ability, we sample 1,000 test examples from the GSM8k test.
Hardware Specification Yes All experiments were conducted on a 7B model using a single A100-80G GPU with results averaged over five trials.
Software Dependencies No The paper mentions using Adam W optimizer and refers to an 'official fine-tuning implementation' but does not specify software libraries or frameworks with version numbers (e.g., PyTorch 1.9, Python 3.8).
Experiment Setup Yes For the Pure Bad dataset and Identy Shift dataset, we set the learning rate to 5 10 5, batch size to 10, and run 5 epochs. For Dirty Summary dataset, we set the learning rate to 2 10 5, batch size to 32, and run 3 epochs. For Math dataset, we set the learning rate to 2 10 5, batch size to 32, and run 1 epoch. For Safe Delta, we set s = 0.1 for safety degradation constraint. We use 512 safe examples for Hessian matrix computation in preparation.