reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets

Authors: Ning Lu, Shengcai Liu, Jiahao Wu, Weiyu Chen, Zhirui Zhang, Yew-Soon Ong, Qi Wang, Ke Tang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments on four diverse datasets with varying settings, our approach consistently preserves safety while ensuring that the utility gain from benign datasets remains unaffected. 5. Experiments
Researcher Affiliation	Collaboration	1Guangdong Provincial Key Laboratory of Brain-Inspired Intelligent Computation, Department of CSE, SUSTech. 2Department of CSE, HKUST. 3Department of CSE, Poly U. 4Huawei Technologies CO. 5CFAR, IHPC, A*STAR. 6College of Computing and Data Science, NTU. 7Department of CSE, SUSTech.
Pseudocode	No	The paper describes the method in Section 4 'The Safe Delta Method' and provides an overview in Figure 3, but it does not include a formally structured pseudocode or algorithm block.
Open Source Code	Yes	1We provide the open-source code at https://github. com/Colin Lu50/Safe Delta
Open Datasets	Yes	To simulate harmful fine-tuning aimed at jailbreaking LLMs, we use the Pure Bad and Identity Shift datasets introduced by Qi et al. (2024). ...sampled from the Alpaca dataset (Taori et al., 2023). ...1,000 samples from the Sam Sum dataset (Gliwa et al., 2019)...training set of the GSM8k dataset (Cobbe et al., 2021). ...Beaver Tails (Ji et al., 2023)...Adv Bench (Zou et al., 2023).
Dataset Splits	Yes	Each dataset includes 100 examples... The Dirty Summary dataset is created by sampling 1,000 samples from the Sam Sum dataset... For the clean dataset, we use the training set of the GSM8k dataset... For Summary utility evaluation, we randomly sample 200 test examples from the Sam Sum dataset... For the evaluation of math reasoning ability, we sample 1,000 test examples from the GSM8k test.
Hardware Specification	Yes	All experiments were conducted on a 7B model using a single A100-80G GPU with results averaged over five trials.
Software Dependencies	No	The paper mentions using Adam W optimizer and refers to an 'official fine-tuning implementation' but does not specify software libraries or frameworks with version numbers (e.g., PyTorch 1.9, Python 3.8).
Experiment Setup	Yes	For the Pure Bad dataset and Identy Shift dataset, we set the learning rate to 5 10 5, batch size to 10, and run 5 epochs. For Dirty Summary dataset, we set the learning rate to 2 10 5, batch size to 32, and run 3 epochs. For Math dataset, we set the learning rate to 2 10 5, batch size to 32, and run 1 epoch. For Safe Delta, we set s = 0.1 for safety degradation constraint. We use 512 safe examples for Hessian matrix computation in preparation.